Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition

08/03/2020 ∙ by M. Esat Kalfaoglu, et al. ∙ Middle East Technical University 0

In this work, we combine 3D convolution with late temporal modeling for action recognition. For this aim, we replace the conventional Temporal Global Average Pooling (TGAP) layer at the end of 3D convolutional architecture with the Bidirectional Encoder Representations from Transformers (BERT) layer in order to better utilize the temporal information with BERT's attention mechanism. We show that this replacement improves the performances of many popular 3D convolution architectures for action recognition, including ResNeXt, I3D, SlowFast and R(2+1)D. Moreover, we provide the-state-of-the-art results on both HMDB51 and UCF101 datasets with 83.99 respectively. The code is publicly available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Action Recognition (AR) pertains to identifying the label of the action observed in a video clip. With cameras everywhere, AR has become essential in many domains, such as video retrieval, surveillance, human-computer interaction and robotics.

A video clip contains two critical pieces of information for AR: Spatial and temporal information. Spatial information represents the static information in the scene, such as objects, context, entities etc., which are visible in a single frame of the video, whereas temporal information, obtained by integrating the spatial information over frames, mostly captures the dynamic nature of the action.

In this work, the joint utilization of two temporal modeling concepts from the literature, which are 3D convolution and late temporal modeling, is proposed and analyzed. Briefly, 3D convolution is a way of generating a temporal relationship hierarchically from the beginning to the end of CNN architectures. On the other hand, late temporal modeling is typically utilized with 2D CNN architectures, where the features extracted by 2D CNN architectures from the selected frames are usually modeled with recurrent architectures, such as LSTM, Conv LSTM.

Despite its advantages, temporal global average pooling (TGAP) layer which is used at the end of all 3D CNN architectures [1, 2, 7, 11, 20, 25, 26, 33] hinders the richness of final temporal information. The features before TGAP can be considered as features of different temporal regions of a clip or video. Although, the receptive field might cover the whole clip, these features are produced by focusing on different temporal regions of a clip. In order to discriminate an action, one part of the temporal feature might be more important than the others or the order of the temporal features might be more beneficial than simply averaging the temporal information. Therefore, TGAP ignores this ordering and fails to fully exploit the temporal information.

Therefore, we propose using attention mechanism of BERT for better temporal modeling than TGAP. BERT not only determines which temporal features are more important with its attention mechanism, but also enables paying attention to the order of the temporal information with its positional encoding mechanism.

To the best of our knowledge, our work is the first to propose replacing TGAP in 3D CNN architectures with late temporal modeling. We also consider that this study is the first to utilize BERT as a temporal pooling strategy in AR. We show that BERT performs better temporal pooling than average pooling, concatenation pooling and standard LSTM. Moreover, we demonstrate that late temporal modeling with BERT improves the performances of various popular 3D CNN architectures for AR which are ResNeXt101, I3D, SlowFast, and R(2+1)D by using the split-1 of HMDB51 dataset. Using BERT R(2+1)D architecture, we obtain new state of the art results; 83.99% and 98.65% top-1 performances in HMDB51 and UCF101 datasets, respectively.

2 Related Work on Action Recognition

In this section, the AR literature is analyzed in two aspects: (i) temporal integration using pooling, fusion or recurrent architectures and (ii) 3D CNN architectures.

2.1 Temporal Integration Using Pooling, Fusion or Recurrent Architectures

Pooling is a well-known technique to combine various temporal features; concatenation, averaging, maximum, minimum, ROI, feature aggregation techniques and time-domain convolution are some of the possible pooling techniques [10, 19].

Fusion frequently used for AR is very similar to pooling. Fusion is sometimes preferred instead of pooling in order to emphasize pooling location in the architecture or to differentiate information from different modalities. Late fusion, early fusion and slow fusion models on 2D CNN architectures can be performed by combining temporal information along the channel dimension at various points in CNN architectures [14]. As a method, the two-stream fusion architecture in [8] creates spatio-temporal relationship with extra 3D convolution layer inserted towards the end of the architecture and fuses information from RGB and optical flow streams.

Recurrent networks are also commonly used for temporal integration. LSTMs are utilized for temporal (sequential) modeling on 2D CNN features extracted from the frames of a video [19, 5]. E.g., VideoLSTM [16] performs this kind of temporal modeling by using convolutional LSTM with spatial attention. RSTAN [6] implements both temporal and spatial attention concepts on LSTM and the attention weights of RGB and optical flow streams are fused.

2.2 3D CNN Architectures

3D CNNs are networks formed of 3D convolution throughout the whole architecture. In 3D convolution, filters are designed in 3D, and channels and temporal information are represented as different dimensions. Compared to the temporal fusion techniques, 3D CNNs process the temporal information hierarchically and throughout the whole network. Before 3D CNN architectures, temporal modeling was generally achieved by using an additional stream of optical flow or by using temporal pooling layers. However, these methods were restricted to 2D convolution and temporal information was put into the channel dimension. The downside of the 3D CNN architectures is that they require huge computational costs and memory demand compared to its 2D counterparts.

The first 3D CNN for AR is the C3D model [24]. Another successful implementation of 3D convolution is the Inception 3D model (I3D) [1], in which 3D convolution is modeled in a much deeper fashion compared to C3D. The ResNet version of 3D convolution is introduced in [11]. Then, R(2+1)D [26] and S3D [33] architectures are introduced in which 3D spatio-temporal convolutions are factorized into spatial and temporal convolutions, and shown to be more effective than traditional 3D convolution architectures. Another important 3D CNN architecture is Channel-Separated Convolutional Networks (CSN) [25] which separates the channel interactions and spatio-temporal interactions which can be thought as the 3D CNN version of depth-wise separable convolution [13].

Slow-fast networks [7] can be considered as a joint implementation of both fusion techniques and 3D CNN architectures. There are two streams, namely fast and slow paths. Slow stream operates at low frame and focuses on spatial information, as the RGB stream in traditional two stream architectures, while fast stream operates at high frame and focuses on temporal information as optical flow stream in traditional two-stream architectures. There is information flow from the fast stream to slow stream.

Although 3D CNNs are powerful, they still lack an effective temporal fusion strategy at the end of the architecture.

3 Proposed Method: BERT-based Temporal Modeling with 3D CNN for Activity Recognition

Figure 1: BERT-based Temporal Pooling

Bi-directional Encoder Representations from Transformers (BERT) [4]

is a bidirectional self-attention method, which has provided unprecedented success in many downstream Natural Language Processing (NLP) tasks. The bidirectional property enables BERT to fuse the contextual information from both directions, instead of relying upon only a single direction, as in former recurrent neural networks or other self-attention methods, such as Transformer

[27]. Moreover, BERT introduces challenging unsupervised pre-training tasks which leads to useful representations for many tasks.

Our architecture utilizes BERT-based temporal pooling as shown in Fig. 1. In this architecture, the selected frames from the input sequence is propagated through a 3D CNN architecture without temporal global average pooling at the end of architecture. Then, in order to preserve the positional information, a learned positional encoding is added to the extracted features. To perform classification with BERT, additional classification embedding () is appended as in [4] (represented as red box in Fig. 1

). The classification of the architecture is implemented with the corresponding classification vector

which is given to the fully connected layer, producing the predicted output label .

The general single head self-attention model of BERT is formulated as:

(1)

where values are the embedding vectors that consists of extracted temporal visual information and its positional encoding;

indicates the index of the target output temporal position;

denotes all possible combinations; and is the normalization term. Function is the linear projection inside the self-attention mechanism of BERT, whereas function denotes the similarity between and : , where the functions and are also linear projections. The learnable functions , and try to project the feature embedding vectors to a better space where the attention mechanism works more efficiently. The outputs of , and functions are also defined as value, query and key, respectively [27]. is Position-wise Feed-forward Network applied to all positions separately and identically: , where

is the Gaussian Error Linear Unit (GELU) activation function

[12].

The final decision of classification is performed with one more linear layer which takes as input. The explicit form of can be written as:

(2)

Therefore, our use of temporal attention mechanism for BERT is not only to learn the convenient subspace where the attention mechanism work efficiently but also learn the classification embedding which learns how to attend the temporal features of the 3D CNN architecture properly.

A similar work for action recognition is implemented with non-local neural networks (NN) [32]. Non-local blocks use a similar attention concept by using 1x1x1 CNN filters in order to realize , and functions. The main difference between the non-local and the proposed BERT attention is that non-local concept [32] is preferred to be utilized not at the end of the architecture but some preferred locations inside the architecture. However, our BERT-based temporal pooling is implemented on the extracted features of the 3D CNN architecture and utilizes multi-head attention concept to create multiple relations with self-attention mechanism. Moreover, it utilizes positional encoding in order to preserve the order information.

4 Experiments

In this part, dataset, implementation details, ablation study, results on different architectures, and comparison with state-of-the-art sections are presented, respectively.

4.1 Dataset

Four datasets are relevant for our study: HMDB51 [15], UCF101 [23], Kinetics-400 [1] and IG65M [9] datasets. HMDB51 consists of 7k clips with 51 classes whereas UCF101 includes 13k clips with 101 classes. Both HMDB51 and UCF101 define three data splits and performances are calculated by averaging the results on these three splits. Kinetics-400 consists of about 240k clips with 400 classes. IG65M is a weakly supervised dataset which is collected by using the Kinetics-400 [1] class names as hashtags on Instagram. There are 65M clips from 400 classes. The dataset is not public for the time being but the pre-trained models are available.

For analyzing the improvements of BERT on individual architectures (Section 4.4), split 1 of the HMDB51 dataset is used whereas the comparisons with the state of the art (See Section 4.5) are performed using the three splits of the HMDB51 and UCF101 datasets. Additionally, the ablation study (See Section 4.3) is conducted using the three splits of HMDB51. Moreover, Kinetics-400 and IG65M are used for pre-trained weights of the architectures before fine-tuning on HMDB51 and UCF101. The pre-trained weights are obtained from the authors of architectures, which are ResNeXt, I3D, Slowfast and R(2+1)D. Among these architectures, R(2+1)D is pre-trained with IG65M but the rest of the architectures are pre-trained with Kinetics-400.

4.2 Implementation Details

For the standard architectures (with TGAP and without any modification to architectures), SGD with learning rate is utilized, except the flow stream of I3D in which learning rate is set to empirically. For architectures with BERT, the ADAMW optimizer [18] with learning rate is utilized except I3D for which the learning rate is set to empirically. For all training runs, the “reducing learning rate on plateau” scheduling is followed. The data normalization schemes are selected conforming with the data normalization schemes of the pre-training of the architectures in order to benefit fully the from pre-training weights. Multi-scale cropping scheme is applied for fine-tuning and testing of all architectures [31]. In the test time, the scores of non-overlapping clips are averaged. The optical flow of the frames are extracted with TV-L1 algorithm [34].

In the BERT architecture, there are eight attention heads and one transformer block. The dropout ratio in

is set to 0.9. Mask operation is applied with 0.2 probability. Instead of using a mask token, the attention weight of masked feature is set to zero. The learned positional embeddings are initialized as zero mean normal weight with 0.02 standard deviation. The default Torch linear layer initialization are used. The classification token (

) is started with all zero. Differently for the I3D-BERT architecture, linear layers and the classification token () of BERT are also initialized as zero mean normal weight with 0.02 standard deviation because it yields better results for I3D-BERT.

4.3 Ablation Study

We will now analyze each step of our contribution and how our method compares with alternative pooling strategies – see Table 1

. For this analysis, ResNeXt101 backbone is utilized with RGB modality, with 112x112 input image size and with 64-frame clips. In this table, temporal pool types, the existence of Feature Reduction with Modified Block (FRMB), the type of the optimizer, top1 performances, the number of parameters and the number of operations are presented as the columns of the analysis.

Type of FRMB? Optimizer Top1 # of # of
Temporal Pooling (%) Params Operations
Average Pooling SGD 74.46 47.63 M 38.56 GFlops
(Baseline)
Average Pooling ADAMW 75.99 47.63 M 38.56 GFlops
Average Pooling ADAMW 74.97 44.22 M 38.36 GFlops
LSTM ADAMW 74.18 47.58 M 38.36 GFlops
Non-Local + ADAMW 76.36 47.35 M 38.43 GFlops
Concatenation +
Fully Connected Layer
Concatenation ADAMW 76.49 44.30 M 38.36 GFlops
Concatenation + ADAMW 76.84 47.45 M 38.36 GFlops
Fully Connected Layer
BERT pooling (Ours) ADAMW 77.49 47.38 M 38.37 GFlops
Table 1: Ablation Study of RGB ResNeXt101 architecture for temporal pooling analysis on HMDB51. FRMB: Feature Reduction with Modified Block.

One important issue is the optimizer. For training BERT architectures in NLP tasks, the ADAM optimizer is chosen [4]. However, SGD is preferred for 3D CNN architectures [11, 1, 7, 26, 3]. Therefore, for training BERT, we choose ADAMW and not ADAM because ADAMW improves the generalization capability of ADAM [18]. In this ablation study, ResNeXt101 architecture (with Average Pooling in Table 1) is also trained with both ADAMW in Table 1 which shows 1.5% increase in performance compared to SGD.

In order to utilize BERT architecture in a more parameter efficient manner, the feature dimension of the output of the ResNeXt101 backbone is reduced from 2048 to 512. For this, two possible methods are considered. These are Feature Reduction with Modified Block (FRMB) and Feature Reduction with Additional Block (FRAB). In FRMB, the final bottleneck block of ResNeXt101 block is replaced with a newer bottleneck block for the feature dimension reduction. In FRAB, an additional bottleneck block is appended to the backbone to reduce dimensionality. The visualization for the implementation of FRMB and FRAB is presented in Figure 2. For this ablation study, FRMB implementation is chosen for two reasons over FRAB. Firstly, FRMB yields about 0.5% better top1 performance than FRAB. Secondly, FRMB has a better computational complexity and parameter efficiency than FRAB because FRAB introduces an additional block to the whole architecture. Therefore, we choose FRMB owing to its lower computational complexity and better parameter efficiency at the cost of 1% decrease in top1 performance compared to the standard backbone (Table 1).

(a) Original
(b) FRMB
(c) FRAB
Figure 2: The implementations of Feature Reduction with Modified Block (FRMB) and Feature Reduction with Additional Block (FRAB)

For a fair comparison, we set the hyper-parameters of the other pooling strategies (LSTM, Non-Local + concatenation + fully connected layer, and concatenation + fully connected layer) such that the number of parameters and the number of operations of different temporal pooling strategies are more or less the same with the proposed BERT pooling: LSTM is implemented in two stacks and with a hidden-layer size 450. The dimension of inter-channels of a Non-Local Attention block (the dimension size of attention mechanism) is set equal to the input size to the Non-Local block which is 512. The number of nodes of a fully connected layer is determined according to the need for equal parameter size with the proposed BERT temporal pooling for fair comparison.

When we analyze Table 1, we observe that, among the 5 different alternatives (with FRMB), BERT is the best temporal pooling strategy. Additionally, our proposed FRMB-ResNeXt101-BERT provides 3% better top1 accuracy than the ResNeXt101-Average Pooling (Baseline) despite the fact that FRMB-ResNeXt101-BERT has a better computational complexity and parameter efficiency than the ResNeXt101-Average Pooling (Baseline) – See Table 1). The BERT layer itself has about 3M parameters and negligible computational complexity with respect to the ResNeXt101 backbone. For the other temporal pooling strategies, LSTM worsens the performance with respect to the temporal average pooling. Concatenation + fully connected layer is also another successful strategy in order to utilize the temporal features better than the average pooling. The addition of a Non-Local Attention block previously to the concatenation + fully connected layer also worsens the performance compared to only concatenation + fully connected layer pooling implementation. It should be highlighted that the original implementation of Non-Local study [32] also prefers not to utilize the Non-Local block at the end of final three bottleneck blocks, which is a consistent fact with the experimental result of this study related with Non-Local implementation.

4.4 Results on Different Architectures

In this part, the improvements brought by the replacement of TGAP with BERT pooling on popular 3D convolution architectures for action recognition is presented, including ResNeXt101 [11], I3D [1], SlowFast[7] and R(2+1)D [26].

4.4.1 ResNeXt Architecture

ResNeXt architecture is essentially ResNet with group convolutions [11]. For testing this architecture, the input size is selected as 112x112 as in the study of [3, 11] and 64 frame length is utilized.

BERT Modality Top1 # Parameters # Operations
RGB 73.73 47.63 M 38.56 GFlops
RGB 77.25 47.38 M 38.37 GFlops
Flow 79.80 47.60 M 34.16 GFlops
Flow 81.76 47.36 M 33.97 GFlops
Both 82.35 95.23 M 72.72 GFlops
Both 83.99 94.74 M 72.34 GFlops
Table 2: Analysis of ResNeXt101 architecture with and without BERT for RGB, Flow, and two-stream modalities on HMDB51 split-1

The results of the ResNeXt101 architecture is given in Table 2. The performance of the architectures are compared over RGB modality, (optical) Flow modality and Both (two-stream) in which both RGB and Flow-streams are utilized and the scores are summed from each stream. In this table, the number of parameters and operations of the architectures are also presented. The implementation of FRMB is chosen over FRAB for this analysis (See Section 4.3 for more details about FRAB and FRMB). Based on the results in Table 2, the most important conclusion is the improvement of the performance by using BERT over the standard architectures (without BERT) in all modalities.

4.4.2 I3D Architecture

I3D architecture is an Inception-type architecture. During I3D experiments, the input size is selected as 224x224 and 64 frame length is used conforming with the I3D study [1]. The result of BERT experiments on I3D architecture is given in Table 3. For I3D-BERT architectures, the final feature dimension of I3D backbone is reduced from 1024 to 512 in order to utilize BERT in a more parameter efficient manner. However, contrary to the ResNeXt101-BERT architecture, FRAB is chosen instead of FRMB because FRAB obtains about 3.6% better top1 result for RGB-I3D-BERT architecture on split1 of HMDB51 (See Section 4.3 for more details about FRAB and FRMB). The reason behind the success of FRAB over FRMB might be that the final Inception block of I3D does not benefit from the pre-trained weights of the larger dataset because of the modification in FRMB.

The experimental results in Table 3 indicate that BERT increases the performance of I3D architectures in all modalities. Moreover, BERT has increased top1 performance of two-stream I3D architecture with 2.09 % which is more than the 1.64 % increase of BERT in two-stream ResNeXt101 architecture. However, the increase in number of parameters with the implementation of BERT is more for I3D compared to ResNeXt101 because of the FRAB implementation instead of FRMB for I3D-BERT architecture.

BERT Modality Top1 # Parameters # Operations
RGB 74.90 12.34 M 111.33 GFlops
RGB 75.75 16.40 M 111.72 GFlops
Flow 76.21 12.32 M 102.52 GFlops
Flow 77.25 16.37 M 102.91 GFlops
Both 80.59 24.66 M 213.85 GFlops
Both 82.68 32.77 M 214.63 GFlops
Table 3: The performance analysis of I3D architecture with and without BERT for RGB, Flow, and two-stream modalities on HMDB51 split-1

4.4.3 SlowFast Architecture

SlowFast architecture [7] introduces a different perspective for the two-stream architectures. Instead of utilizing two different modalities as two identical streams, the overall architecture includes two different streams (namely fast and slow streams or paths) with different capabilities with only RGB modality. In SlowFast architecture, slow stream has a better spatial capability, while fast stream has a better temporal capability. Fast stream has better temporal resolution and less channel capacity compared to the slow stream. Although it might be possible to utilize SlowFast architecture with also optical flow modality, the authors of SlowFast did not consider this in their study. Therefore, in this study, the analysis of BERT is also implemented by only considering the RGB modality.

The SlowFast architecture in our experiment is derived from a ResNet-50 architecture. The channel capacity of the fast streams is one eighth of the channel capacity of the slow stream. The temporal resolution of fast stream is four times the temporal resolution for the slow stream. The input size is selected as 224x224 and 64-frame length is utilized with the SlowFast architecture conforming with the SlowFast study [7].

For the implementation of BERT on SlowFast architecture, two alternative solutions are proposed: Early-fusion BERT and late-fusion BERT. In early-fusion BERT, the temporal features are concatenated before BERT layer and only a single BERT module is utilized. To make the concatenation feasible, the temporal resolution of the fast stream is decreased to the temporal resolution of the slow stream. In late-fusion BERT, two different BERT modules are utilized, one for each stream, and the outputs of two BERT modules from two streams are concatenated. The figure for early-fusion and late-fusion is shown in Figure 3.

(a) Early-fusion
(b) Late-fusion
Figure 3: Early-fusion and late-fusion implementations of BERT on SlowFast architecture.

In order to utilize BERT architecture with less parameters, the final feature dimension of SlowFast backbone is reduced similar to the ResNeXt101-BERT and I3D-BERT architectures. Similar to the I3D-BERT architecture, FRAB is chosen instead of FRMB because FRAB obtains about 1.5% better top1 result for SlowFast-BERT architecture on the split1 of HMDB51 (See Section 4.3 for more details about FRAB and FRMB). For early-fusion BERT, the feature dimension of the slow stream is reduced from 2048 to 512 and the feature dimension of the fast stream is reduced from 256 to 128. For late-fusion BERT, only the feature dimension of the slow stream is reduced from 2048 to 512. The details about the size of the dimensions can also be seen in Figure 3.

BERT Top1 # Parameters # Operations
78.37 33.76 M 50.72 GFlops
✓(early-fusion) 79.54 43.17 M 52.39 GFlops
✓(late-fusion) 80.78 42.04 M 52.14 GFlops
Table 4: The performance analysis of SlowFast architecture with and without BERT for RGB modality on HMDB51 split-1

The results of using BERT on SlowFast architecture are given in Table 4. First of all, both BERT solutions perform better than the standard SlowFast architecture. From the number of parameters perspective, the implementation of BERT on SlowFast architecture is not as much as efficient in comparison to ResNeXt101 architecture because of the FRAB implementation instead of FRMB as in the case of I3D-BERT. Moreover, the parameter increase of RGB-SlowFast-BERT is even higher than RGB-I3D-BERT because of the two-stream implementation of SlowFast network for RGB input modality. The increase in the number of operations is also higher in the implementation of SlowFast-BERT than the I3D-BERT and ResNeXt101-BERT because of the higher temporal resolution in SlowFast architecture and two-stream implementation for RGB modality.

For the two alternative proposed BERT solution in Table 4, late-fusion BERT yields better performance with better computational complexity in contrast with early-fusion BERT. Although the attention mechanism is implemented jointly on the concatenated features, the destruction of the temporal richness of fast stream at some degree might be the reason for the worse performance of early-fusion BERT.

4.4.4 R(2+1)D Architecture

R(2+1)D [26] architecture is a ResNet-type architecture consisting of separable 3D convolutions in which temporal and spatial convolutions are implemented separately. For this architecture, 112x112 input dimensions are applied following the paper and 32-frame length is applied instead of 64-frame because of huge memory demand of this architecture and to be consistent with the paper [26]. The selected R(2+1)D architecture has 34 layers and implemented with basic block type instead of bottleneck block type (for further details about block types, see [11]). The most important difference of R(2+1)D experiments from the previous architectures is the utilization of the IG65M pre-trained weights, instead of Kinetics pre-trained weights (see Section 4.1 for details). Therefore, this detail should be considered while comparing this architecture with the aforementioned ones. The analysis of R(2+1)D BERT architecture is limited to RGB modality since the study [9] of the IG65M dataset where R(2+1)D architecture is preferred is limited to RGB modality.

# BERT Top1 # Parameters # Operations
81.76 63.67 M 152.95 GFlops
84.77 66.67 M 152.97 GFlops
Table 5: The performance analysis of R(2+1)D architecture with and without BERT for RGB modality on HMDB51 split-1

The experiments of BERT on R(2+1)D architecture are presented in Table 5. The feature dimension of R(2+1)D architecture is already 512 which is the same with the reduced feature dimension of ResNeXt101 and I3D backbones for BERT implementations. Therefore, we do not use FRMB or FRAB for R(2+1)D. There is an increase of about 3M parameters and the increase in the number of operations is still negligible. The performance increase of BERT on R(2+1)D architecture is about 3% which is a significant increase for RGB modality as in the case of ResNeXt101-BERT architecture.

4.5 Comparison with State-of-the-Art

In this section, the results of the best BERT architectures from the previous section are compared against the state-of-the-art methods. For this aim, two leading BERT architectures are selected: Two-Stream BERT ResNeXt101 and RGB BERT R(2+1)D (see Section 4.4). Note that these two architectures use different pre-training datasets, namely IG65 and Kinetics-400 for ResNext101 and R(2+1)D, respectively.

The results of the architectures on HMDB51 and UCF101 datasets are presented in Table 6. The table indicates if an architecture employs explicit optical flow. Moreover, the table lists the pre-training dataset used by the methods.

As shown in Table 6, BERT increases the Top-1 performance of the two-stream ResNeXt101 with 1.77% and 0.41% in HMDB51 and UCF101, respectively. Additionally, BERT improves the Top-1 performance of RGB R(2+1)D with 2.77% and 0.48% in HMDB51 and UCF101, respectively. The results obtained by the R(2+1)D BERT architecture is the current state-of-the-art result in AR, to the best of our knowledge. Among the architectures pre-trained in Kinetics-400, the two-stream ResNeXt101 BERT is again the best in HMDB51 but the second best in the UCF101 dataset. This might be owing to the fact that HMDB51 involves some actions that can be resolved only using temporal reasoning and therefore benefits from BERT’s capacity.

An important point to note from the table is the effect of pretraining with the IG65M dataset. RGB R(2+1)D (without Flow) pre-trained with IG65M obtains 6.72% and 1.37% better Top-1 performance than the one pre-trained with Kinetics-400, indicating the importance of the pre-training dataset even if the samples are collected in a weakly-supervised manner.

Model Uses Flow? Extra Training Data HMDB51 UCF101
IDT [28] 61.70 -
Two-Stream [22] ImageNet 59.40 88.00
Two-stream Fusion + IDT [8] ImageNet 69.20 93.50
ActionVlad + IDT [10] ImageNet 69.80 93.60
TSN [30] ImageNet 71.00 94.90
RSTAN + IDT [17] ImageNet 79.90 95.10
TSM [30] Kinetics-400 73.50 95.90
R(2+1)D [26] Kinetics-400 74.50 96.80
R(2+1)D [26] Kinetics-400 78.70 97.30
I3D [1] Kinetics-400 80.90 97.80
MARS + RGB + Flow [3] Kinetics-400 80.90 98.10
FcF [21] Kinetics-400 81.10 -
ResNeXt101 Kinetics-400 81.78 97.46
EvaNet [20] Kinetics-400 82.3 -
HAF+BoW/FV halluc [29] Kinetics-400 82.48 -
ResNeXt101 BERT (Ours) Kinetics-400 83.55 97.87
R(2+1)D IG65M 81.22 98.17
R(2+1)D BERT (Ours) IG65M 83.99 98.65
Table 6: Comparison with the state-of-the-art.

5 Conclusions

This study combines the two major components from AR literature, namely late temporal modeling and 3D convolution. Although there are many pooling, fusion and recurrent modeling strategies that are applied to the features from 2D CNN architectures, we firmly believe that this manuscript is the first study that removes temporal global average pooling (TGAP) and better employs temporal information at the output of 3D CNN architectures. To utilize these temporal features, an attention-based mechanism called BERT is selected which has proven its success over other recurrent architectures in NLP tasks. The effectiveness of this idea is proven on most of the popular 3D CNN architectures which are ResNeXt, I3D, SlowFast and R(2+1)D. In addition, significant improvements over the-state-of-the-art techniques are obtained in HMDB51 and UCF101 datasets.

The most important contribution of this study is the introduction of late temporal pooling concept, paving the way for better late temporal pooling strategies over BERT on 3D CNN architectures as a future work, although better performance is obtained with BERT over average pooling, concatenation pooling and standard LSTM pooling. Additionally as a future work, unsupervised concepts can still be proposed on BERT 3D CNN architectures, since the real benefits of BERT architecture rises to the surface with unsupervised techniques. Finally, the proposed method has also a potential to improve the similar tasks with AR, such as temporal and spatial action localization and video captioning.

Acknowledgments

This work was supported by an Institutional Links grant under the Newton-Katip Celebi partnership, Grant No. 217M519 by the Scientific and Technological Research Council of Turkey (TUBITAK) and ID [352335596] by British Council, UK. The numerical calculations reported in this paper were partially performed at TUBITAK ULAKBIM, High Performance and Grid Computing Center (TRUBA resources)

References

  • [1] J. Carreira and A. Zisserman (2017-11) Quo Vadis, action recognition? A new model and the kinetics dataset. In

    Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017

    ,
    Vol. 2017-Janua, pp. 4724–4733. External Links: ISBN 9781538604571, Document Cited by: §1, §2.2, §4.1, §4.3, §4.4.2, §4.4, Table 6.
  • [2] Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng (2018-07) Multi-fiber Networks for Video Recognition. In

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    ,
    Vol. 11205 LNCS, pp. 364–380. External Links: ISBN 9783030012458, Document, ISSN 16113349 Cited by: §1.
  • [3] N. Crasto, P. Weinzaepfel, K. Alahari, and C. Schmid (2019-06) MARS: Motion-augmented rgb stream for action recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2019-June, pp. 7874–7883. External Links: ISBN 9781728132938, Document, ISSN 10636919 Cited by: §4.3, §4.4.1, Table 6.
  • [4] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018-10) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. External Links: Link Cited by: §3, §3, §4.3.
  • [5] J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell (2017-04) Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4), pp. 677–691. External Links: Document, ISSN 01628828 Cited by: §2.1.
  • [6] W. Du, Y. Wang, and Y. Qiao (2018-03) Recurrent spatial-temporal attention network for action recognition in videos. IEEE Transactions on Image Processing 27 (3), pp. 1347–1360. External Links: Document, ISSN 10577149 Cited by: §2.1.
  • [7] C. Feichtenhofer, H. Fan, J. Malik, and K. He (2019-10) Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision, Vol. 2019-October, pp. 6201–6210. External Links: ISBN 9781728148038, Document, ISSN 15505499 Cited by: §1, §2.2, §4.3, §4.4.3, §4.4.3, §4.4.
  • [8] C. Feichtenhofer, A. Pinz, and A. Zisserman (2016-12) Convolutional Two-Stream Network Fusion for Video Action Recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2016-Decem, pp. 1933–1941. External Links: ISBN 9781467388504, Document, ISSN 10636919 Cited by: §2.1, Table 6.
  • [9] D. Ghadiyaram, M. Feiszli, D. Tran, X. Yan, H. Wang, and D. Mahajan (2019-05) Large-scale weakly-supervised pre-training for video action recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2019-June, pp. 12038–12047. External Links: Link Cited by: §4.1, §4.4.4.
  • [10] R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. Russell (2017) ActionVLAD: Learning spatio-temporal aggregation for action classification. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, External Links: ISBN 9781538604571, Document Cited by: §2.1, Table 6.
  • [11] K. Hara, H. Kataoka, and Y. Satoh (2018-12) Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 6546–6555. External Links: ISBN 9781538664209, Document, ISSN 10636919 Cited by: §1, §2.2, §4.3, §4.4.1, §4.4.4, §4.4.
  • [12] D. Hendrycks and K. Gimpel (2016) Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR abs/1606.08415. External Links: Link, 1606.08415 Cited by: §3.
  • [13] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017-04)

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    .
    External Links: Link Cited by: §2.2.
  • [14] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F. F. Li (2014) Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, External Links: ISBN 9781479951178, Document, ISSN 10636919 Cited by: §2.1.
  • [15] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre (2011) HMDB: A large video database for human motion recognition. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2556–2563. External Links: ISBN 9781457711015, Document Cited by: §4.1.
  • [16] Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G.M. Snoek (2018-01) VideoLSTM convolves, attends and flows for action recognition. Computer Vision and Image Understanding 166, pp. 41–50. External Links: Document, ISSN 1090235X Cited by: §2.1.
  • [17] J. Lin, C. Gan, and S. Han (2018-11) TSM: Temporal Shift Module for Efficient Video Understanding. Proceedings of the IEEE International Conference on Computer Vision 2019-October, pp. 7082–7092. External Links: Link Cited by: Table 6.
  • [18] I. Loshchilov and F. Hutter (2017-11) Decoupled Weight Decay Regularization. 7th International Conference on Learning Representations, ICLR 2019. External Links: Link Cited by: §4.2, §4.3.
  • [19] J. Y. H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici (2015-10) Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 07-12-June, pp. 4694–4702. External Links: ISBN 9781467369640, Document, ISSN 10636919 Cited by: §2.1, §2.1.
  • [20] A. Piergiovanni, A. Angelova, A. Toshev, and M. S. Ryoo (2018-11) Evolving Space-Time Neural Architectures for Videos. Proceedings of the IEEE International Conference on Computer Vision 2019-October, pp. 1793–1802. External Links: Link Cited by: §1, Table 6.
  • [21] A. Piergiovanni and M. S. Ryoo (2018-10) Representation Flow for Action Recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2019-June, pp. 9937–9945. External Links: Link Cited by: Table 6.
  • [22] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, Vol. 1, pp. 568–576. External Links: ISSN 10495258 Cited by: Table 6.
  • [23] K. Soomro, A. R. Zamir, and M. Shah (2012-12) UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. External Links: Link Cited by: §4.1.
  • [24] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, External Links: ISBN 9781467383912, Document, ISSN 15505499 Cited by: §2.2.
  • [25] D. Tran, H. Wang, M. Feiszli, and L. Torresani (2019-04) Video classification with channel-separated convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Vol. 2019-Octob, pp. 5551–5560. External Links: Link, ISBN 9781728148038, Document, ISSN 15505499 Cited by: §1, §2.2.
  • [26] D. Tran, H. Wang, L. Torresani, J. Ray, Y. Lecun, and M. Paluri (2018-12) A Closer Look at Spatiotemporal Convolutions for Action Recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 6450–6459. External Links: ISBN 9781538664209, Document, ISSN 10636919 Cited by: §1, §2.2, §4.3, §4.4.4, §4.4, Table 6.
  • [27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Å. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 2017-Decem, pp. 5999–6009. External Links: ISSN 10495258 Cited by: §3, §3.
  • [28] H. Wang and C. Schmid (2013) Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision, External Links: ISBN 9781479928392, Document, ISSN 1550-5499 Cited by: Table 6.
  • [29] L. Wang, P. Koniusz, and D. Q. Huynh (2019-06) Hallucinating IDT Descriptors and I3D Optical Flow Features for Action Recognition with CNNs. Proceedings of the IEEE International Conference on Computer Vision 2019-October, pp. 8697–8707. External Links: Link Cited by: Table 6.
  • [30] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2018) Temporal Segment Networks for Action Recognition in Videos. External Links: ISBN 1705.02953v1, Document, ISSN 01628828 Cited by: Table 6.
  • [31] L. Wang, Y. Xiong, Z. Wang, and Y. Qiao (2015-07) Towards Good Practices for Very Deep Two-Stream ConvNets. External Links: Link Cited by: §4.2.
  • [32] X. Wang, R. Girshick, A. Gupta, and K. He (2018-12) Non-local Neural Networks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 7794–7803. External Links: ISBN 9781538664209, Document, ISSN 10636919 Cited by: §3, §4.3.
  • [33] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy (2018) Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 11219 LNCS, pp. 318–335. External Links: ISBN 9783030012663, Document, ISSN 16113349 Cited by: §1, §2.2.
  • [34] C. Zach, T. Pock, and H. Bischof (2007) A duality based approach for realtime TV-L1 optical flow. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 4713 LNCS, pp. 214–223. External Links: ISBN 3540749330, Document, ISSN 03029743 Cited by: §4.2.