Recurrence to the Rescue: Towards Causal Spatiotemporal Representations

by   Gurkirt Singh, et al.
Oxford Brookes University

Recently, three dimensional (3D) convolutional neural networks (CNNs) have emerged as dominant methods to capture spatiotemporal representations, by adding to pre-existing 2D CNNs a third, temporal dimension. Such 3D CNNs, however, are anti-causal (i.e., they exploit information from both the past and the future to produce feature representations, thus preventing their use in online settings), constrain the temporal reasoning horizon to the size of the temporal convolution kernel, and are not temporal resolution-preserving for video sequence-to-sequence modelling, as, e.g., in spatiotemporal action detection. To address these serious limitations, we present a new architecture for the causal/online spatiotemporal representation of videos. Namely, we propose a recurrent convolutional network (RCN), which relies on recurrence to capture the temporal context across frames at every level of network depth. Our network decomposes 3D convolutions into (1) a 2D spatial convolution component, and (2) an additional hidden state 1× 1 convolution applied across time. The hidden state at any time t is assumed to depend on the hidden state at t-1 and on the current output of the spatial convolution component. As a result, the proposed network: (i) provides flexible temporal reasoning, (ii) produces causal outputs, and (iii) preserves temporal resolution. Our experiments on the large-scale large "Kinetics" dataset show that the proposed method achieves superior performance compared to 3D CNNs, while being causal and using fewer parameters.


page 1

page 2

page 3

page 4


Online Spatiotemporal Action Detection and Prediction via Causal Representations

In this thesis, we focus on video action understanding problems from an ...

Learning Efficient Video Representation with Video Shuffle Networks

3D CNN shows its strong ability in learning spatiotemporal representatio...

A Closer Look at Spatiotemporal Convolutions for Action Recognition

In this paper we discuss several forms of spatiotemporal convolutions fo...

Dissected 3D CNNs: Temporal Skip Connections for Efficient Online Video Processing

Convolutional Neural Networks with 3D kernels (3D CNNs) currently achiev...

Would Mega-scale Datasets Further Enhance Spatiotemporal 3D CNNs?

How can we collect and use a video dataset to further improve spatiotemp...

Spatiotemporal Fusion in 3D CNNs: A Probabilistic View

Despite the success in still image recognition, deep neural networks for...

Seq-U-Net: A One-Dimensional Causal U-Net for Efficient Sequence Modelling

Convolutional neural networks (CNNs) with dilated filters such as the Wa...

1 Introduction

Convolutional neural networks (CNN) are starting to exhibit gains in action recognition from videos similar to those previously observed in image recognition [21, 36] thanks to new 3D CNNs [3, 52, 45, 11, 51]. For instance, Hare et al[11] have shown that that is the case for the 3D version of 2D residual networks (ResNets) [12]. Other recent works [52, 45] show that 3D convolutions can be decomposed into 2D (spatial) and 1D (temporal) convolutions (yielding the S3D architecture), and that these separate convolution operators not only have fewer parameters to train [52], but also perform better than 3D (spatiotemporal) convolutions.

All such 3D CNNs, however, have significant issues: firstly, they are not causal [2], for they process future frames to predict the label of the present frame. Causal inference is essential for many problems in video understanding, e.g., online action detection [38, 40], future action label prediction [20], and future representation prediction [47].
Secondly, temporal convolution size needs to be picked by hand at every level of network depth, and is usually set to be equal to the spatial convolution size [3, 45] in all state-of-the-art 3D networks [3, 2, 52, 44]. Whatever the choice, the temporal reasoning horizon or ‘receptive field’ is effectively constrained by the size of the temporal convolution kernel(s). Varol et al[46] suggest that using long-term temporal convolution could help increase the receptive field and enable long-term reasoning. However, fixing the size of the temporal convolution kernel at each level is a non-trivial task, which requires expert knowledge.
Lastly, 3D CNNs do not preserve temporal resolution, as the latter drops with network depth. Preserving temporal resolution, in opposition, is essential in problems such where we needs predictions to be made on each frame of input clip while reasoning about temporal context, e.g. bounding box regression on each frame for action tube detection [9, 38] or temporal label prediction on each frame for temporal action segmentation [34, 30] or online video segmentation [54].

Our method: combining implicit and explicit temporal modelling. An alternative to the implicit modelling of a video’s temporal characteristics via 3D CNNs is the use of models which encode this dynamics explicitly. Hidden state models, such as Markov ones [1]

, recurrent neural networks (RNN) 

[15, 8]

, and long short-term memory (LSTM) networks 

[13] can all be used to model temporal dynamics in videos  [7, 28], allowing flexible temporal reasoning.
In an approach which aims to combine the representation power of explicit dynamical models with the discriminative power of 3D networks, in this work we propose a recurrent alternative to 3D convolution illustrated in Figure 1(c). In this new architecture, spatial reasoning, expressed in the form of a mapping from the input to a hidden state, is performed by spatial convolution (with kernel size ), whereas temporal reasoning (represented by hidden state-to-hidden state transformations) is performed by a convolution (with kernel size ) taking place at every point in time and at each level (depth-wise) of the network. In a setting which combines the effect of both operators, the hidden state at time (denoted by ) is a function of both the output of the spatial convolution and of the output of the temporal (hidden) convolution with as an input. As a result, the temporal reasoning horizon is effectively unconstrained, as the hidden state is a function of the input in the interval .

Figure 1: Illustration of 3D architectures used on sequences of input frames. (a) Standard 3D convolution, as in I3D [3] or C3D [44]. (b) 3D convolution decomposed into a 2D spatial convolution followed by a 1D temporal one, as in S3D [52]. In R(2+1)D [45] the number of middle planes is increased to match the number of parameters in standard 3D convolution. (c) Our proposed decomposition of 3D convolution into 2D spatial convolution and recurrence (in red) in the temporal direction, with a convolution as hidden state transformation.

Causality. Contrarily to 2D CNNs [35], which are both causal and preserve temporal resolution but do not perform temporal reasoning using the context provided by neighbouring frames, Carreira et al[2] propose to solve this problem by means of a sequential causal 3D network which uses present and past frames for predicting the action class at the present frame. In alternative, they suggest to utilise the temporal information flow at different CNN depth levels by predicting labels based on the top level representations from past frames. However, performance is observed dropping in both cases as compared to the counterpart 3D CNN version [3] of same network [2].
Our proposed method, in opposition, solves both problems via a recurrent convolutional network (RCN) which explictly performs temporal reasoning at each level of the network thanks to recurrence, while maintaining temporal resolution and being causal, without decline in performance.

Tranfer learning and initialisation. The ability of a network to be transferred knowledge acquired by solving other tasks (transfer learning) has been proved to be crucial to performance. Famously, when Tran et al[44] first proposed 3D CNNs for video action recognition their observed performance turned out to be merely comparable to that of 2D CNNs [35], e.g., on the Sports1M dataset [17]. For these reasons, Carreira et al[3]

later proposed to use transfer learning to boost 3D CNN performance. There, 2D CNNs are inflated into 3D ones by replacing 2D convolutions with 3D convolution: as a result, 2D network weights as pre-trained on ImageNet 


can be used to initialise their 3D CNNs. Indeed, a rich variety of pre-trained 2D models are available on multiple deep-learning frameworks, e.g. PyTorch or Tensorflow. This makes the use of 3D CNNs more widely accessible, for training a full 3D CNN is a computationally expensive task: 64 GPUs were used to train the latest state-of-the-art 3D CNNs

[3, 45, 2], which is a big ask for smaller research groups. That makes ImageNet-based initialisation even more crucial for speeding up the training process of parameters heavy 3D networks.

Unlike Tran et al[45], where the number of filters changes, our recurrent convolutional network exhibits similar performance improvement gains when it comes to ImageNet initialisation as those of inflated 3D CNNs (I3D)  [3]. Interestingly, Le et al[22] show that simple RNNs can exhibit long-term memory properties if appropriately initialised, even better than LSTMs. They use ReLU [10]activation functions because of their fast convergence and sparsity properties [10], as opposed to what happens with traditional RNNs, and in line with standard practice in CNNs. We thus follow [22] and initialise our hidden-to-hidden convolution (kernel size

) by the identity matrix, where

is number of hidden state kernels. Spatial convolution, instead, can simply be initialised using ImageNet pre-trained weights.


: In summary, we present a new approach to video feature extraction based on an original convolutional network with recurrent hidden states at each depth level, which:

  • allows flexible temporal reasoning, exploiting information coming from all the input sequence observed up to time ;

  • generates output representations in a causal way, allowing online video processing and enabling the use of 3D networks in scenarios in which causality is key;

  • preserves temporal resolution to produce predictions for each frame, e.g. segmentation [30, 34, 54].

  • is designed to directly benefit from model initialisation via ImageNet pre-trained weights, as opposed to state of the art approaches, and in line with clear emerging trends in the field.

In our experiments we show that our proposed RCN outperforms baseline I3D models, while displying all the above desirable properties.

2 Related work

Since the two-stream 2D CNNs proposed by Simonyan et al[35] produced performances comparable to that of traditional features such as IDT [48, 49], HOG [4], HOG3D [19], HOF [5], 2D features has been extensively used in action recognition and detection. Efforts have also been made to capture more temporal information. For instance, Donahue used LSTMs [7, 55] on top of 2D CNN features. Wang et al[50] proposed to train 2D CNNs with segment-level inputs. Other approaches include, among others, CNN features in combination with LSTMs [24] for temporal action detection, 2D features used in an encoder-decoder setup along with temporal convolutions [30], and conditional random field on series of 2D features [33] for temporal action detection and recognition. All these methods show promising results; in all of them, however, the optical stream and the few layers on the top of the 2D features are the only sources of temporal reasoning.

Later, the 3D CNNs proposed by Ji et al[14] and Tran et al[14, 44] (C3D architecture) promised to be able to perform spatial and temporal reasoning at the same time. However, lack of proper initialisation and of sufficient data to train them were crippling. Carreira et al[3] thus proposed to address both problems by inflating 2D CNNs into 3D CNNs. They used the weights of 2D CNNs pre-trained on ImageNet [6] to initialise 3D networks, and trained the latter on the large scale Kinetics dataset [18]. The resulting performance was beyond that of 2D models. These models, however, remain heavy and very expensive to train – e.g., 64 GPUs were used in  [3].

In alternative, the notion of factorising 3D Convolutional Networks was explored by Sun et al[42]. This inspired [52, 29, 45] to decompose 3D convolutions into 2D (spatial) and 1D (temporal) convolutions. Recent work by Xi et al[52] have promised to reduce complexity (represented by the number of parameters) while making up for the performance lost via a gating mechanism. Tran et al[45] would keep the number of parameters equal to that of 3D convolutions, but boost performance by increasing the number of kernels in the 2D layer. The size of the temporal convolution kernel needs to be fixed to a relatively small number (e.g., 3 in  [3, 52, 45]). Varol et al[46] have thus proposed the use of long-term convolutions to capture long-term dependencies in the data. Wang et al[51], instead, have introduced non-local blocks in existing 3D CNNs architectures, in order to capture the non-local context in both the spatial and the temporal (present, past and future) dimensions.

The use of temporal convolutions in all the above methods is, however, inherently anti-causal. Moreover, temporal context is limited by the size of the temporal convolution kernel or of the non-local step. Also, temporal resolution is not preserved in temporal convolutions with strides: to address this problem, Shou 

et al[32] uses temporal deconvolution layers on top of C3D network [44] to produce one-to-one mapping from input frames to coorsponding frame-level label prediction for temporal action detection.

Relevantly, Carreira et al[2] have recently proposed to address the anti-causal nature of 3D CNNs by predicting the future and utilising the flow of information at different depth levels in the network. They train their causal network to mimic a 3D network – however, the resulting performance drop is significant.
We propose to solve all the above described problems with 3D CNNs (causality, long-term dependencies, temporal resolution) thanks to our proposed Recurrent Convolutional Network (RCN). Recurrence allows our RCN to model dependency on longer time scales, while temporal reasoning is performed only in the past and an output is taken out at every time step. In addition, our RCN uses fewer parameters compared to any existing 3D CNN architecture but still perform better than I3D networks.

Recurrent convolutions have indeed been tried for image generation [16, 26], scene labeling [27], and scene text recognition  [31] and video reperesnetations [53]. In particular, the convolutional LSTM (C-LSTM) proposed in [53] for precipitation forecasting is closely related to our work. The authors proposed to use a network made of convolutional LSTMs, whereas we use 2D convolutions for spatial reasoning and an additional convolution, applied in a recurrent fashion, for temporal reasoning. C-LSTM has been applied to videos [23] to capture spatial attention over time on top of 2D feature maps. However, its performance has turned out to be sub-par as compared to that of 2D CNNs.

Recurrent models provide the benefit of being casual and temporal resolution preserving. Our Recurrent Convolutional Network exploits both these benefits, along with the 3D CNN philosophy and wisdom.

3 2D to 3D CNNs

There are two main reason why 3D CNNs [3, 45, 52, 51] evolved from 2D CNNs [36, 50] perform better than 3D CNNs built from scratch [44, 14]. Firstly, 2D CNNs are well tried and tested on the problem of image recognition and a video is, after all, a sequence of images – hence, transferability makes sense. Secondly, intilisation from a good starting guess leads to better convergence in videos [3], since the number of parameters in 3D networks is huge.
In this section we recall the two basic types of 3D CNNs that can be built using a 2D CNN architecture. We will use them as baselines in our experiments.

3.1 Inflated 3D Network

A 2D network can be converted/inflated into a 3D one by replacing a 2D () convolution with a 3D () convolution as shown in Figure 1(a). Usually, the kernel’s temporal dimension is set to be equal to the spatial dimension , as in the inflated 3D network (I3D) [3] or the convolutional 3D network (C3D) [44].
Here, we inflate the 18 layer ResNet [12] network into an I3D one as shown in Table 1, where each 2D convolution is inflated into a 3D convolution. Similarly to the I3D network in [3], a convolutional layer is used for classification, instead of the fully connected layer used in [45, 51]. A convolutional classification layer allows us to evaluate the model on sequences of variable length at test time. In this way, video-level results can be obtained in a seemless fashion as compared to computing clip-level outputs in sliding window fashion to obtain video-level output.

Layers Output Sizes
Names , , number of kernels I3D RCN
conv1 ; stride
res2 &
res3 &
res4 &
res5 &
pool spatial pooling
convC classification;
mean temporal pooling
Table 1: I3D ResNet-18 model architecture with its outputs sizes and RCN’s output sizes for input with size. Each Convolution layer of the network is defined by temporal () and spatial () size of kernel and number of kernels. ConvC layer uses the number of classes as the number of kernels.

3.2 Separated Convolution Networks

Figure 1(b) shows how a 3D () convolution can be decomposed into a () spatial convolution and a () temporal one. Usually, the size of the temporal kernel is set to be equal to its spatial dimension , as in both I3D [3] and C3D [44].
Such a separated convolutional network (S3D) was introduced by Xi et al[52]. The authors showed that such a decomposition not only reduces the number of parameters, but also delivers performances very much similar to those of traditional 3D CNNs. After taking a closer look at 3D convolution separation, Tran et al[45] argued that if the number of kernels used in spatial convolution (Figure 1(b)) are increased in such way that the numbers of parameters of spatial and temporal convolution combined are equal to the number of parameters in 3D convolution, then performance actually improves over 3D networks. However, such a change in the number of kernels does not allow initialisation from ImageNet pre-trained models any longer. They refer to their model as model. Although the latter can be considered a special case of Pseudo-3D networks (P3D)  [29] models, because of its homogeneity and simplicity the (2+1)D model performs better than P3D.

We re-implemented (2+1)D without ImageNet initialisation as an additional baseline as it has the most promising result without any additional trick like gating in S3D.

4 Recurrent Convolutional 3D Network

In this section, we describe the architecture of our Recurrent Convolutional (3D) Network (RCN) and its properties in detail. Firstly, we show how Recurrent Convolutional Units (RCU) (§ 4.1) are used to replace 3D convolutions in the I3D network (§ 3.1), resulting in our RCN model (§ 4.2). Next, in § 4.4 we show how our network behaves in a causal manner and can describe longer temporal dependencies as compared to 3D convolutions. Then, we show how RCUs preserve temporal resolution in § 4.4. Lastly, in § 4.5 and § 4.6, we illustrate the initialisation process for RCN and RCU.

4.1 Recurrent Convolutional Unit

A pictorial diagram of our proposed recurrent convolutional unit (RCU) in Figure 1(c). The input at any time instant passes through 3D spatial convolution (with kernel of size , denoted by ). The result is added to the output of a recurrent convolution operation, with kernel denoted by , of size .
The result is termed the hidden state of the unit. Analytically, a recurrent convolutional unit can be described by the following relation:


where and are parameters of the RCU, and represents the convolution operator.

4.2 Recurrent Convolutional Network

Figure 2 represents a simple recurrent convolutional network (RCN) composed by a single RCU unit, unrolled up to time . At each time step , an input is processed by the RCU and the other layers to produce an output .

The unrolling principle allows us to build an RCN from 2D/3D networks, e.g. by replacing 3D convolutions with RCUs in any I3D network. Indeed, the network architecture of our proposed model builds on the I3D network architecture shown in Table 1, where the same parameters (, number of kernels) used for 3D convolutions are used for our RCU. Unlike I3D, however, our RCN does not require a temporal convolution size (cfr. Table 1) as a parameter. As in 2D or I3D ResNet models [12, 45, 11]

, our proposed RCN also has residual connections. The initial hidden state

, as shown in Figure 2, is initialised by the output of the bottom 2D convolution layer at . The hidden state at time is considered to be the output at that time instant – as such, it acts as input to next hidden state and to the whole next depth-level layer. Table 1 describes the network architecture of ResNet-18, i.e., a residual network [12] with 18 layers. Similarly, we can build upon other variants of ResNet. In particular, in this work we use ResNet-34 to generate comparisons with state-of-the-art approaches.

Figure 2: An unrolled recurrent convolutional (RCN) network composed by a single RCU layer followed by a batch normlisation (BN) layer, a ReLU activation layer, and a final convolutional layer used for classification.

4.3 Temporal Resolution Preservation

The output sizes for both I3D and our proposed RCN are shown in Table 1. Our RCN only uses spatial pooling and a convolutional layer for classification, unlike the spatiotemporal pooling of [3, 51, 45]. From Table 1

, compared to I3D, RCN produces 16 classification score vectors with an input sequence length of 16.

This one-to-one mapping from input to output is essential in many tasks, ranging from temporal action segmentation [32, 30], to temporal action detection [37], to action tube detection [38]. In all such tasks, video-level accuracy is not enough, but we need frame-level results in terms, e.g., of detection bounding boxes and class scores. Temporal convolution behaves in a similar way to spatial convolution: it results in lower resolution feature maps as compared to the input as the depth of the network increases.

Unlike the temporal deconvolution proposed in  [32, 30], our RCN inherently addresses this problem (see Table 1).

4.4 Causality and Longer-Term Dependencies

A size- temporal convolution operation uses as input a sequence to generate an output at time .
In the case of our recurrent convolutional network, instead, is a function of only the inputs from the present and the past (up to the initial time step), as shown in Figure 2. Its independence from future inputs makes the output at time causal. Thus our RCN as presented here is not only causal, but poses no constraints on the modelling of temporal dependencies (as opposed to the upper bound of in the case of temporal convolutions). Temporal dependencies are only limited by the input sequence length at training time.

As in traditional RNNs, we have the option to unroll the same network to model arbitrary input sequence lengths at test time, thus further increasing the possible horizon of temporal dependencies.

4.5 ImageNet Initialisation for the 2D Layers

The I3D model proposed by Carreira et al[3] greatly owes its success to a good initialisation from 2D models trained on ImageNet [6]. By inflating these 2D models, we can benefit from their ImageNet pre-trained weights, as in most state-of-the-art 3D models [3, 52, 51]. We follow the same principle and initialise all 2D layers using the weights of available pre-trained 2D ResNet models [12]. It is noteworthy that the other state-of-the-art (2+1)D model by Tran et al[45] cannot, instead, exploit ImageNet initialisation, because of the change in the number of kernels.

4.6 Identity Initialisation for the Hidden Layers

The presence of a hidden state convolution (, see Figure 2) layer at every depth level of the unrolled network makes initialisation a tricky issue. The random initialisation of the hidden state convolution component could destabilise the norm of the feature space between two 2D layers. In response to a similar issue, Le et al[22] presented a simple way to initialise RNNs when used with ReLU [10] activation functions. Most state-of-the-art 2D models [12, 43] make indeed use of ReLU as activation function of choice for fast and optimal convergence [10].

Following the example of Le et al[22] and others [25, 39], we initialise the weights of the hidden state convolution kernel () with the identity matrix. Identity matrix initialisation is shown [22, 25] to capture longer term dependencies. It also helps induce forgetting capabilities in recurrent models, unlike traditional RNNs.

In our experiments, we initialise all hidden state convolution kernels to the identity matrix.

5 Experiments

In this section, we evaluate our recurrent convolutional network on the challenging Kinetics [18] and UCF101 [41] datasets to study its various original features and compare it with state-of-the-art 3D CNN models. We used sequence of RGB frames as input in all our experiments.
The Kinetics dataset comprises classes and videos; each video contains a single atomic action. Kinetics has become a defacto benchmark for recent action recognition works [2, 52, 45, 3]. The average duration of a video clip in Kinetics is seconds.
UCF101 dataset has classes and videos; nowadays, it is used to evaluate the action recognition [41] and transfer learning [3] capabilities of 3D CNNs.

5.1 Training Setup

Different training setups may lead to different convergence points. E.g., the ResNet-18-based I3D model trained by Tran et al[45] is better in terms of final video accuracy as compared to a similar model trained by Hara et al[11]. As shown in Table 2, both models (first two rows) are trained on 16 frames from the input training clip.

The main hyperparameters involved in the training of a 3D network are learning rate, batch size, and the number of iterations. These parameters are interdependent, and their optimal setting depends on the computational power at disposal. For instance, Tran 

et al[44] would use 64 GPUs with a learning rate of 0.01 and the training process is distributed across in multiple machines. In such a case, when vast computational resources are available[44, 3, 2], training takes 10-15 hours [45], allowing the time to locate the optimal parameters. The availability of such computational power, however, is scarce. Another important aspect of the training process is the presence of various input data argumentation operations, e.g., random crop, horizontal flip, image intensities jittering and temporal jittering.

A maximum of 4 GPUs was used in our training process. We used the ResNet model as a backbone architecture for our experiment. Therefore we worked to reproduce the results in [44] for I3D and their proposed (2+1)D model using our training setup for fair comparison, as shown in Tables 2 and  3. We used a batch size of 64 for training - (on GPUs) and - (on GPUs) frame clips as a training sample. The initial learning rate was set to . We reduced the learning rate by a factor of after , , and iterations. Moreover, training was stopped after iterations (number of batches).

We used or frames long RGB

clips to train our RCN and baseline I3D and (2+1)D, models. As for data augmentation, we used random crop and horizontal flip with 50% probability, along with mean subtraction and normalisation with standard deviation. We used the PyTorch ( framework to implement our training and evaluation setups.

Evaluation: For a fair comparison, we computed clip-level and video-level accuracy in exactly the same way as described in [45]. Ten clips were evaluated per video, and average scores were used for video-level classification.

Network #Params Initialisation Clip % Video %
I3D [11] 33.4M random 54.2
I3D [45] 33.4M random 49.4 61.8
(2+1)D [45] 33.3M random 52.8 64.8
I3D 33.4M random 47.4 59.8
RCN [ours] 12.8M random 49.3 61.8
(2+1)D 33.4M random 50.5 62.6
I3D 33.4M ImageNet 51.2 63.3
RCN [ours] 12.8M ImageNet 53.0 65.1
trained with our implementation and training setup.
Table 2: Clip-level and video-level action recognition accuracy on the validation set of the Kinetics dataset for different ResNet-18-based models, trained using 8-frame-long clips as input.

5.2 Comparison with Baseline I3D and (2+1)D

As mentioned, we re-implemented the I3D and (2+1)D models using ResNet-18 and ResNet-34 as a backbone. The Resnet-18-I3D architecture is presented in Table 1. Based on the latter, we built a (2+1)D [45] architecture in which we matched the number of parameters of separated convolutions to that of standard 3D convolutions, as explained in [45]. We trained these models in the 2 GPU-setup as explained above, and trained our proposed RCN using the same settings.

The results of the I3D and (2+1)D implementations reported in Tran et al[44] are shown in the top half of Table 2. When comparing them with our implementations of the same networks in the bottom half, it is clear that our training is suboptimal to that of Tran et al[45] – the likely reason being that we only train them on 2 GPUs as compared to 64 GPUs (in addition, in [45] almost twice as many training iterations were run). As a result, we also use different batch sizes. Moreover, the input data augmentations applied in the two cases are different. Our optimisation process is still far better than [11], as shown in line 1 of Table 2.

This also suggests that additional resources could likely lead to a further 2/3% performance improvement.

The number of parameters in our proposed RCN model is million (M), as opposed to 33.4M in both the I3D and (2+1)D models, see Table 2. It is remarkable to see that, despite a times reduction in the number of parameters, RCN still outperforms both I3D and (2+1)D when trained using ImageNet initialisation. Further, RCN surpasses I3D also under random initialisation, while using times fewer model parameters. Similarly, Table 3 shows that RCN outperforms both I3D and (2+1)D models, when the base model is ResNet-34 and the input clip length is 16, again, while using far fewer parameters.

Model #Params Initialisation Acc%
(2+1)D 63.7M random 67.0
I3D 63.7M ImageNet 67.6
RCN [ours] 24.1M ImageNet 69.5
trained with our implementation and training setup.
Table 3: Comparison of our RCN with state-of-the-art I3D and (2+1)D models on the validation set of Kinetics using a ResNet-34 architecture trained using 16-frame-long clips.

ImageNet initialisation proves to be useful for both the I3D and our RCN models. While (2+1)D performs (Table 2, row 7) better than RCN (row 6) with random initialisation, our RCN recovers to improve over (2+1)D (row 7) with ImageNet initialisation, whereas (2+1)D cannot make use of free ImageNet initialisation. This seems to be a severe drawback for the (2+1)D model, and a big advantage for I3D and RCN. One may argue that, if the purpose is to build on existing 2D models, then RCN and I3D are a better choice, whereas if new 3D models are preferred then (2+1)D might prove useful. The latter does provide better performance with random initialisation, but at the price of requiring many more parameters than RCN.

Random initialisation for hidden state parameters resulted in unstable training. Thus, in all the experiments with RCN, we used identity matrix initialisation instead.

5.3 Comparison with Other Causal Networks

A comparison with other 3D causal networks is a must, as we claim the causal nature of the network to be one of the main contributions of our work, making RCN best suited to online applications such as action detection and prediction.

Carreira et al[2] have proposed two causal variants of the I3D network. Their sequential version of I3D, however, shows a drop in performance as compared to I3D as seen in lines 1 and 2 of Table 4. Their parallel version is much faster than the sequential one, but suffers from an ever larger performance decline (see line 3 in Table 4).

Model Clip-length Acc%
InceptionV1-I3D [2] 64 71.8
InceptionV1-I3D-seq [2] 64 71.1
InceptionV1-I3D-par [2] 64 54.5
ResNet34-I3D 16 67.6
ResNet34-RCN [ours] 16 69.5
causal model, unlike respective I3D versions
trained with our implementation and training setup.
Table 4: Comparison between RCN and other causal models on the Kinetics validation set.

As to our new model, we can show that CRN consistently outperforms I3D in all the considered settings, namely: when using ResNet-18 as base models, input clip size equal to 8, with or without ImageNet initialisation (cf. Table 2); and when using ResNet-34, clip input size 16 (Table 4).

It is fair to say that our CRN network is the best performing causal network out there when compared with the corresponding I3D version.

Model Clip-length Initialisation Acc%
InceptionV1-I3D [3] 64 ImageNet 71.8
InceptionV1-S3D [52] 64 ImageNet 72.2
InceptionV1-S3Dg [52] 64 ImageNet 74.7
InceptionV3-S3Dg [52] 64 ImageNet 77.7
ResNet101-I3D [51] 32 ImageNet 74.4
ResNet101-I3D-NL [51] 32 ImageNet 76.0
ResNet101-I3D-NL [51] 128 ImageNet 77.7
ResNet34-(2+1)D [45] 32 random 72.0
ResNet34-(2+1)D 16 random 67.2
ResNet34-I3D 16 ImageNet 67.6
ResNet34-RCN [ours] 16 ImageNet 69.5
causal model, unlike the rest
trained with our implementation and training setup.
Table 5: Video-level action classification accuracy of different models on the validation set of the Kinetics dataset.

5.4 Comparison with I3D Variants

To present a comprehensive comparison with the state of the art, we think appropriate to take a closer look at other variants of I3D models, albeit those being all anti-causal.

The S3Dg model has a gating operation applied to the outputs of Separated 3D (S3D) [52]. The authors show that gating provides a substantial performance gain as compared to I3D (cfr. the first part of Table 5).

Non-Local (NL) operations as proposed by Wang et al[51] also can be proved to improve performance over the base I3D model, as shown in the second part of Table 5.

What needs to be remarked is that gating and NL operations are not at all constrained to be applied on top of I3D or S3D models: indeed, they can also be used in conjunction with (2+1)D and our own RCN model. As in this work we focus on comparing our network with other 3D models, we chose I3D and (2+1)D as baselines (Sec. 5.2).
From the last three rows of Table 5, we can sensibly speculate that, as our RCN performs better than those 3D models, the application of further gating and NL layers is likely to lead to performances superior to those of [52] and [51].

Input clip length is another factor in determining the final performance of any network. We can see from the Table that all the Inception-based models are trained on 64 frame-long clips, one of the reasons why Inception nets work better that ResNet models while using fewer parameters. Among the latter, ResNet101-I3D-NL [51] is shown to work better with an even longer input clip length. Thus, the evidence supports that training on larger input sequences boosts performance, while clashing with memory limitations.

In our experiments, as mentioned, we stuck to 16-frame clips as input and compared our proposed RCN with baseline I3D models (bottom rows of Table 5). We think this provide enough evidence of the validity of our proposal to move away from temporal convolutional networks [52, 51, 45, 3], and replace them with more sophisticated and causal structures. As with the role of additional layers, it is fair to predict that more extensive training on longer clips (32,64,128) has a serious potential to take RCN to outperform the state of the art in absolute terms.

5.5 Comparison on UCF101

The last three rows of Table 6 shows the performance of baseline I3D, (2+1)D and RCN models on UCF101. As on Kinetics, our RCN (last row) outperforms our baseline implementation of I3D and (2+1)D.

The second and third part of Table 6 shows the state-of-the-art results achieved by the S3Dg [52] and (2+1)D models as implemented in [45], respectively. Our implementations of RCN and (2+1)D deliver much comparable results, despite being trained on shorter clips (16 versus 32 or 64).

Model Clip Initialisation Acc%
VGG-16 C3D [44] 16 Sports-1M 82.3
Resnet101 P3D [29] 16 Sports-1M 88.6
InceptionV1-I3D [3] 64 ImageNet+Kinetics 95.6
InceptionV1-S3Dg [52] 64 ImageNet+Kinetics 96.8
ResNet34-(2+1)D [45] 32 Sports1M 93.6
ResNet34-(2+1)D [45] 32 Kinetics 96.8
ResNet34-(2+1)D 16 Kinetics 93.1
ResNet34-I3D 16 ImageNet+Kinetics 93.3
ResNet34-RCN [ours] 16 ImageNet+Kinetics 93.6
trained with our implementation and training setup.
Table 6: Video-level action classification accuracy on the UCF101 dataset (averaged over 3 splits).
Network Clip-length Initialisation Acc %
RCN [ours] 8 random 61.8
RCN [ours] 16 random 64.8
RCN [ours] 8 ImageNet 65.1
Table 7: Video-level action recognition accuracy on the Kinetics validation set for different ResNet-18-based RCN models.

6 Discussion

6.1 On the Efficent Training of 3D Networks

Two basic things are clear from our experience with training heavy 3D models (I3D, (2+1)D, RCN) on large-scale datasets such as Kinetics. Firstly, training is very computationally expensive and memory heavy; secondly, longer input clips are crucial to achieve better optimisation which, however, renders the first issue even more severe. We feel that how to train these model efficiently is still a wide open problem, whose solution is essential to speed up the process for a wider adoption. We observed that ImageNet initialisation does speed up the training procedure, and helps reach local minima much quicker. In the case of both I3D and RCN, ImageNet initialisation improves the video classification accuracy on Kinetics by almost compared to random initialisation when using the same number of training iterations, as shown in the first and last row of Table 7. Furthermore, 16 frames models also exhibit an almost performance improvement compared to the corresponding 8 frame models, but this comes at nearly twice the computational cost.

The bottom line is that we should strive for more efficient implementations of 3D models for the sake of their adoption.

6.2 Evolution of Recurrence with Network Depth

To conclude, it is interesting to take a closer look at the statistics for the weight matrices () associated with the hidden state at every RCU layer in the RCN network.

In Figure 3 we can see that the mean of its diagonal elements increases and their standard deviation decreases with the depth of the network. A similar trend can be observed for the entire matrix as well. This means that the matrix gets sparser with the depth of the network. We think that this is because the temporal reasoning horizon (receptive field) increases with depth for any given feature, and this puts more focus on the feature learning in the early part of the network. A similar phenomenon is observed in networks based on temporal convolutions [52]. A more in-depth analysis of this fact is conducted in the supplementary material.

Figure 3: Mean and standard deviation (Std) of the elements of the weight matrices () of the hidden state at every RCU layer in RCN, along with the mean and Std of their diagonal elements.

7 Conclusions

In this work we presented a recurrence-based convolutional network (RCN) able to generate causal spatio-temporal representations while using 2.6 times fewer parameters compared to its traditional 3D counterparts. RCN preserves temporal resolution, a crucial feature in many applications, and can model long-term temporal dependencies in the data without the need to specify temporal extents.
The proposed RCN is not only causal in nature and temporal resolution-preserving, but it is also shown to outperform the main baseline 3D networks in all fair comparisons run with clip sizes 8 and 16. We showed that ImageNet-based initialisation is at the heart of the success of 3D CNNs. Indeed, although RCN is recurrent in nature, it can still utilise the weights of a pre-trained 2D network for initilisation.
The causal nature of our recurrent 3D convolutional network opens up manifold research directions, with direct and promising potential application in areas such as online action detection and future event/action prediction.


  • [1] L. E. Baum and T. Petrie.

    Statistical inference for probabilistic functions of finite state markov chains.

    The annals of mathematical statistics, 37(6):1554–1563, 1966.
  • [2] J. Carreira, V. Patraucean, L. Mazare, A. Zisserman, and S. Osindero. Massively parallel video networks. In

    Proc. European Conf. Computer Vision

    , 2018.
  • [3] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 4724–4733. IEEE, 2017.
  • [4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In IEEE Int. Conf. on Computer Vision and Pattern Recognition, volume 1, pages 886–893 vol. 1, June 2005.
  • [5] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow and appearance. In European conference on computer vision, pages 428–441. Springer, 2006.
  • [6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
  • [7] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015.
  • [8] J. L. Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
  • [9] G. Gkioxari and J. Malik. Finding action tubes. In IEEE Int. Conf. on Computer Vision and Pattern Recognition, 2015.
  • [10] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 315–323, 2011.
  • [11] K. Hara, H. Kataoka, and Y. Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pages 18–22, 2018.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [13] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [14] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231, 2013.
  • [15] M. Jordan. Attractor dynamics and parallelism in a connectionist sequential machine. In Proc. of the Eighth Annual Conference of the Cognitive Science Society (Erlbaum, Hillsdale, NJ), 1986, 1986.
  • [16] N. Kalchbrenner, A. v. d. Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu. Video pixel networks. arXiv preprint arXiv:1610.00527, 2016.
  • [17] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
  • [18] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  • [19] A. Klaser, M. Marszałek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC 2008-19th British Machine Vision Conference, pages 275–1. British Machine Vision Association, 2008.
  • [20] Y. Kong, Z. Tao, and Y. Fu. Deep sequential context networks for action prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1473–1481, 2017.
  • [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.
  • [22] Q. V. Le, N. Jaitly, and G. E. Hinton. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015.
  • [23] Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G. Snoek. Videolstm convolves, attends and flows for action recognition. Computer Vision and Image Understanding, 166:41–50, 2018.
  • [24] S. Ma, L. Sigal, and S. Sclaroff. Learning activity progression in lstms for activity detection and early detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1942–1950, 2016.
  • [25] T. Mikolov, A. Joulin, S. Chopra, M. Mathieu, and M. Ranzato. Learning longer memory in recurrent neural networks. arXiv preprint arXiv:1412.7753, 2014.
  • [26] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016.
  • [27] P. H. Pinheiro and R. Collobert. Recurrent convolutional neural networks for scene labeling. In

    31st International Conference on Machine Learning (ICML)

    , number EPFL-CONF-199822, 2014.
  • [28] R. Poppe. A survey on vision-based human action recognition. Image and vision computing, 28(6):976–990, 2010.
  • [29] Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 5534–5542. IEEE, 2017.
  • [30] C. L. M. D. F. René and V. A. R. G. D. Hager. Temporal convolutional networks for action segmentation and detection. In IEEE International Conference on Computer Vision (ICCV), 2017.
  • [31] B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence, 39(11):2298–2304, 2017.
  • [32] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 1417–1426. IEEE, 2017.
  • [33] G. A. Sigurdsson, S. K. Divvala, A. Farhadi, and A. Gupta. Asynchronous temporal fields for action recognition. In CVPR, volume 5, page 7, 2017.
  • [34] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision, 2016.
  • [35] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems 27, pages 568–576. Curran Associates, Inc., 2014.
  • [36] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [37] G. Singh and F. Cuzzolin. Untrimmed video classification for activity detection: submission to activitynet challenge. arXiv preprint arXiv:1607.01979, 2016.
  • [38] G. Singh, S. Saha, M. Sapienza, P. Torr, and F. Cuzzolin. Online real-time multiple spatiotemporal action localisation and prediction. In IEEE Int. Conf. on Computer Vision, 2017.
  • [39] R. Socher, J. Bauer, C. D. Manning, et al. Parsing with compositional vector grammars. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 455–465, 2013.
  • [40] K. Soomro, H. Idrees, and M. Shah. Predicting the where and what of actors and actions through online action localization. 2016.
  • [41] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human action classes from videos in the wild. Technical report, CRCV-TR-12-01, 2012.
  • [42] L. Sun, K. Jia, D.-Y. Yeung, and B. E. Shi. Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 4597–4605, 2015.
  • [43] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
  • [44] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. IEEE Int. Conf. on Computer Vision, 2015.
  • [45] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018.
  • [46] G. Varol, I. Laptev, and C. Schmid. Long-term temporal convolutions for action recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6):1510–1517, 2018.
  • [47] C. Vondrick, H. Pirsiavash, and A. Torralba. Anticipating the future by watching unlabeled video. arXiv preprint arXiv:1504.08023, 2015.
  • [48] H. Wang, A. Kläser, C. Schmid, and C. Liu. Action Recognition by Dense Trajectories. In IEEE Int. Conf. on Computer Vision and Pattern Recognition, 2011.
  • [49] H. Wang and C. Schmid. Action Recognition with Improved Trajectories. In IEEE Int. Conf. on Computer Vision, pages 3551–3558, 2013.
  • [50] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer Vision, pages 20–36. Springer, 2016.
  • [51] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [52] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proc. European Conf. Computer Vision, pages 305–321, 2018.
  • [53] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pages 802–810, 2015.
  • [54] C. Xu, C. Xiong, and J. J. Corso. Streaming hierarchical video segmentation. In European Conference on Computer Vision, pages 626–639. Springer, 2012.
  • [55] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4694–4702, 2015.