Video action recognition requires strong temporal reasoning. Temporal reasoning in deep networks can be implemented by 3D (space+time) convolutions, temporal (average) pooling, or recurrent layers that aggregate frame-level or spatio-temporal representations of an input sequence. The effectiveness of temporal reasoning hereby depends on the representation upon which the aggregation is performed.
While spatio-temporal features are more effective than frame-level features in video recognition tasks, they are also more costly to compute and harder to train, including a large number of parameters. Furthermore, even with a deep backbone the receptive field of spatio-temporal features are confined to the small fixed-size frame window from which they were computed. Since more abstract representations benefit from larger temporal context, a gradual increase of the temporal extend of receptive fields in the higher layers of the feature extraction backbone would be desired instead.
A feasible solution to increase the temporal size of receptive fields is to let features from adjacent branches interact. Temporal Segment Network (TSN)  showed that it is useful to apply the difference of frames as input to a CNN for video action recognition. Temporal Difference Network (TDN)  extended this idea to compute the difference of frame level features obtained from each layer in a CNN. Such hardcoded approach considers that there is a significant variation of information in adjacent frames. This is however not always guaranteed in practice. It is therefore desirable to let a network decide when to subtract the features from adjacent frames.
In this paper we present a hierarchical feature aggregation scheme that is lightweight and can be plugged into any deep architecture with CNN backbone. In particular, when plugged into TSN, a performance gain of 24.2% is obtained on Something-v1  dataset with an addition of only 1.14% parameters and 1.04% Floating Point Operations (FLOPs). The main idea, as illustrated in Fig. 0(a), is to transfer features between adjacent branches at each layer in a way that adjacent features interact as they develop into the higher level representation. The amount of feature transfer is controlled through a convolutional gate that decides to what amount a feature is subtracted or added into the adjacent branch, as detailed in Fig. 0(b). Through end-to-end training, the network learns to route features through the hierarchy, hereby causing the temporal extend of receptive fields to grow with depth and adapt to the input. We evaluate our scheme on a number of existing models, TSN, TRN and ECO, and show its flexibility and effectiveness in improving action recognition performance.
Ii Related work
, many deep learning based techniques have been developed for video action recognition. The most effective and simple extension was developed by Simonyan and Zisserman. Their method consists of two different CNNs trained on a single RGB image frame and a stack of optical flow images followed by a late fusion of the prediction scores. The image stream encodes the appearance information while the optical flow stream encodes the motion information. Several works followed this approach to find a suitable fusion of the two streams 
and exploring residual connections between them. The downside of this approach is its reliance on externally computed optical flow which is computationally intensive.
In order to address the aforementioned problem, researchers have explored techniques for extracting spatio-temporal features from the RGB frames itself. Karpathy et al.  examined several fusion approaches at various levels of the CNN hierarchy. They found that a slow fusion approach where the features from adjacent frames are fused at multiple hierarchical levels of the CNN results in the best performance. Their fusion approach stacks the convolutional features from adjacent frames and perform temporal convolutions for extracting the temporal features. Later, Tran et al.  developed 3DCNN with 3D convolutional layers so that spatio-temporal features can be extracted from a set of multiple RGB frames. Their approach showed that 3D convolutions are capable of extracting spatio-temporal features from small video segments consisting of a small number of frames. Several approaches have been later proposed for exploiting the parameters learned by a 2DCNN for image recognition [11, 12, 13]. Carreira and Zisserman  developed a 3DCNN by inflating the 2D filters to 3D. Qiu et al.  and Tran et al.  proposed to factorize the 3D convolutions to a 2D convolution for spatial encoding followed by a 1D convolution for temporal encoding. Wang et al.developed an architecture based on 3DCNNs which decouples the spatial and temporal encoding. Their approach performs a linear encoding on individual frame features and a multiplicative encoding between a set of features from multiple frames. Such approaches using 3DCNNs have two major drawbacks, firstly, the massive increase in the number of trainable parameters and secondly, they extract spatio-temporal features from just a small sample of adjacent frames. The first problem makes such approaches difficult to be trained on smaller datasets and the second problem makes them incapable of extracting long range spatio-temporal features from videos.
In order to extract long-range spatio-temporal features, several techniques that perform sparse sampling of the video frames, as opposed to the dense sampling used in 3DCNNs, have been proposed. Such approaches use a 2DCNN for extracting the frame level features followed by a late fusion using Recurrent Neural Networks [14, 15], pooling techniques such as average pooling or max pooling , Fully Connected (FC) layers , or 3D convolutions , among others. Such late fusion based approaches ignore the information extracted at the different layers of a CNN. Several approaches have been proposed to address this drawback by fusing the features from consecutive frames at different layers of hierarchy of a CNN [2, 18, 19]. Ng and Davis  propose to use an extra CNN that accepts the difference of feature maps from adjacent frames at different layers of a separate CNN. The final prediction is done by average pooling of the scores obtained from the two networks. Sun et al.improves this approach by applying a Sobel filter on the features in addition to the temporal differencing operation. Lee et al.  improves this method by forwarding the spatial and temporal features across a single network. This is achieved by applying a set of fixed filters followed by differencing of the features from adjacent frames. Lin et al. 
proposes a plug and play module that shifts the channels in the feature tensor across the temporal dimension for transferring information across frames.
Iii HF-Net: Hierarchical Feature Aggregation Networks
In this section we describe our hierarchical feature aggregation for video representation learning architectures that utilize a CNN backbone for frame-level or spatio-temporal feature extraction. We then present details of our action recognition model used in the experiments.
Iii-a Hierarchical Feature Aggregation
Deep architectures with CNN backbone by design do not account for correlations that
may exist between frame-level or spatio-temporal features at the earlier layers of the CNN.
Indeed, in a CNN backbone a feature for input at layer is computed from independently of the features from inputs other than , that is, , where
block represents a set of convolutional layers with non-linearities. This may limit performance by design when the input sequence exhibits strong temporal dependence such as video frames in action recognition. In such a setting, a feature aggregation in the early layers may better capture the temporal dependencies in the data sequence.
Our hierarchical feature aggregation is based on the assumption that adjacent features will benefit from interacting in order to produce a more abstract representation at each layer of the architecture. We want this interaction to be pairwise and feed-forward computable, that is, our backbone hierarchical aggregation scheme is realized with a re-designed .
In order to define
block2 we consider two elements that are pervasive in the state-of-the-art architectures. First, most architectures perform feature pooling at the higher level, most often, average pooling. Second, many networks follow a two stream structure where optical flow is late fused from a separate branch. Some works [1, 2] already reported performance improvements even with frame differencing as a simplified version of optical flow. In our design of
both feature differencing and feature averaging are considered. We want the network to select or interpolate between these two modes, and maximise flexibility by learning to decide so locally in the feature tensor. This way, we provide the network with the capability to selectively route features through the hierarchy and build up discriminative representations integrating temporal context. That is, growing temporal receptive fields as we go up in the hierarchy.
where , ‘’ and ‘’ represent convolution operation and Hadamard product, respectively, and acts as a gating tensor that chooses between averaging and differencing operations. Fig. 0(b) illustrates
block2. Note that we subtract a gated from the backbone feature but we add it when computing . Thus, the total feature flow on the sequence is preserved. Gating is generated by performing a 3D convolution on the stacked features with a kernel followed by non-linearity:
Since the range of non-linearity is [-1, 1], the network is capable of selecting from both averaging and differencing operations. and represent the weights and the bias of the 3D convolution. We use a kernel with output channel in the 3D convolution.
Iii-B Layer Implementation
The proposed hierarchical aggregation across each layer can be realized in parallel, thereby allowing for a fast training and inference. Assuming that sequences of frames are applied as a batch to the network, the parallel implementation of hierarchical aggregation is realized as:
At layer of the backbone, we first feed the input sequence through the corresponding convolutional
block of the backbone to obtain a new sequence
. We then left-shift and zero-pad to obtainand compute the gating tensor through a 3D convolution on the stacked . This is followed by to map the resulting tensor to values in the range [-1, 1]. In order to obtain the sequence of output features, the Hadamard product of the gating tensor and the left shifted feature tensor is added to the input feature tensor while the Hadamard product of the right shifted gating tensor and the input feature tensor is subtracted.
Iii-C Hierarchical Feature Aggregation Networks
Hierarchical Feature Aggregation Network (HF-Net) is obtained by adding the Hierarchical Feature Aggregation Module, presented in Sec. III-A, at various levels of the backbone CNN used in the network. Our approach is generalizable to any CNN architecture (e.g, HF-TSN). In the experiments, we use Inception with Batch Normalization (BN)  as our backbone CNN. We plug in our Hierarchical Feature Aggregation module after each Inception block in the CNN. At each iteration, a set of sparsely sampled frames from a video is passed to the network. In addition to the common feed forward flow of information in conventional CNNs, the hierarchical feature aggregation modules enable a horizontal flow of information across the features at different levels of the CNN hierarchy. The output features corresponding to all the frames from the final layer of the CNN can then be pooled together for encoding the long range temporal features. Our architecture is highly flexible and can be combined with a number of temporal pooling techniques such as TSN , Temporal Relation Network (TRN) , Gated Recurrent Unit (GRU) , or 3DCNN . In Sec. IV-C, we evaluate the performance of our proposal on a number of existing approaches and show its flexibility and effectiveness in improving action recognition performance.
Iv Experiments and Results
We evaluate the proposed hierarchical feature aggregation strategy on a number of existing models and standard action recognition datasets. We first briefly describe the considered datasets, evaluation metrics, and implementation details. Then, we perform an ablation analysis to quantitatively evaluate the impact of different HF properties and components. Finally, we compare the performance of various HF-Net architectures against state-of-the-art results on different public action recognition datasets. We show HF enhances recognition performance of all models where it is applied, with negligible additional complexity.
Iv-a Datasets and evaluation protocol
We evaluate our proposed hierarchical feature aggregation technique on three standard action recognition benchmarks, Something-v1 , EPIC-KITCHENS  and HMDB51 . Something-v1 consists of 86K and 11K videos in the training and validation sets from 174 action classes. We report the performance on the validation set. EPIC-KITCHENS dataset comprises of egocentric videos with fine-grained activity labels. The labels are provided as verb and nouns and the dataset consists of 24K videos in the training set and 10K videos in the test set. We report the performance obtained on the test set. HMDB51 dataset consists of videos collected from Youtube and contains around 6000 video clips from 51 action categories. The dataset is provided with three standard train/test splits and the final recognition accuracy is reported as the average of the accuracy obtained on the three splits. Both Something-v1 and EPIC-KITCHENS datasets consists of crowd collected videos with actions involving objects. Something-v1 gives importance to the actions rather than the objects involved in the action while EPIC-KITCHENS gives relevance to both actions and objects. HMDB51 is a smaller dataset with less complex action categories of shorter temporal span, which can be identified with simple appearance cues. Some sample frames from the datasets are shown in Fig. 2. In addition to the recognition accuracy, we also compare the complexity of the models in terms of number of parameters and Floating Point Operations (FLOPs).
Iv-B Implementation Details
As explained in Sec. III-C, we choose BNInception as the backbone. The proposed HF module is added after each Inception block of the CNN. The entire network, including the BN
layers, is trained for 60 epochs usingStochastic Gradient Descent (SGD) optimization algorithm with a batch size of 32. The learning rate is fixed as 0.001 and is reduced by a factor of 0.1 after 25 and 40 epochs. Dropout at a rate of 0.5 is applied before the final classification layer to avoid overfitting. Random scaling and cropping, as proposed in  is used as data augmentation. During inference, only the center crop of the frames are used. In all experiments, we use 16 frames as the input to the model.
Iv-C Ablation Analysis
In this section, we report the ablation analysis performed on the validation set of Something-v1 dataset. We compare the performance improvement by adding the proposed HF module on the CNN backbone of a standard action recognition technique. We choose TSN  as the baseline since it is one of the standard techniques for video action recognition. TSN divides each video into a pre-defined number of segments and applies one frame from each of the segments as the input to the network. The average of the output from each of the frames is used for computing the prediction scores. Thus, TSN fails to encode the temporal relations between video frames and hence acts as a suitable baseline for showing the capability of the proposed hierarchical feature aggregation approach in extracting spatio-temporal features.
Tab. I shows the result of the ablation study conducted. We first evaluated the performance of the model when a single module is added after the final inception block of the backbone. We obtained an improvement of . A further gain of is obtained by adding 5 modules after each of the final 5 inception blocks. Finally, we add 10 modules after each of the inception blocks which resulted in an improvement of over the TSN baseline. We also evaluated the performance of the model when two independent 3D convolutional layers are used to compute the gating, thereby breaking the conservative flow of features. This increases the number of parameters and complexity slightly, albeit the performance of the network is reduced thereby proving that feature flow conservation is useful for spatio-temporal feature extraction in HF-Nets.
In Fig. 2(a), we show the top 10 action classes that improved the most by adding HF module to the backbone CNN of TSN. From the figure, it can be seen that the network has enhanced its ability to distinguish between action classes that are similar in appearance, such as
Unfolding something and
Dropping something next
to something and
Showing something next
to something, etc. The t-SNE plots of the features from the final layer of the CNN corresponding to these 10 action classes are shown in Fig. 2(b) and 2(c). It can be seen that the features from HF-TSN show a lower intra-class and higher inter-class variability compared to those from TSN.
|ECO ||BNInception+3D ResNet-18||ImageNet+Kinetics||41.4||68.2|
|ARTNet ||3D ResNet-18||Kinetics||-||70.9|
|I3D ||3D ResNet-50||ImageNet+Kinetics||41.6||74.8|
|C3D ||3D ResNet-18||Sports-1M||-||62.1|
|R(2+1)D ||3D ResNet-34||ImageNet+Kinetics||-||74.5|
|Non-local I3D ||3D ResNet-50||ImageNet+Kinetics||44.4||-|
|Non-local I3D+GCN ||3D ResNet-50+GCN||ImageNet+Kinetics||46.1||-|
|ECOLite ||BNInception+3D ResNet-18||ImageNet+Kinetics||42.2||68.5|
Iv-D State-of-the-Art Comparison
In order to compare methods at the same conditions, we compare only with those models using raw RGB frames as input. However, note that the proposed approach is extendable to optical flow images as well.
Something-v1 and HMDB51: Tab.II compares the proposed approach with state-of-the-art techniques on Something-v1 and HMDB51 datasets. For Something-v1 dataset, in addition to TSN, we also evaluate the performance of the proposed approach on TRN. In both methods, by adding the proposed HF module to the backbone, a large improvement in the performance is observed. From the table, one can see that the proposed approach results in comparable performance to other approaches that use a bigger backbone CNN such as ResNet-50 or I3D. It should also be noted that other methods which achieve superior performance use strong pre-training using Kinetics dataset. Importantly, the HF augmented models achieve this improved performance at a fraction of the number of parameters and FLOPs compared to other methods, allowing for faster inference and smaller memory footprint, e.g. rendering them to be deployed in mobile devices. Fig. 3(a) illustrates the accuracy vs complexity of state-of-the-art techniques. From the figure, it can be seen that HF-Net results in a large boost in the recognition accuracy over the baseline models.
For HMDB51, we augment three baselines, TSN , TRN  and ECOLite , with HF module and observe a gain of more than on recognition accuracy over all the three baselines. As mentioned previously, HMDB51 consists of actions with shorter temporal duration. As a result, 3D CNNs that perform dense sampling of frames resulted in the best performance on this dataset [13, 11] due to the ability of 3D convolution layers in extracting spatio-temporal features from a short temporal window. ECOLite consists of a smaller number of 3D convolution layers on top of a 2D CNN backbone and thus is a middle ground between TSN and 3D CNNs. Even though ECOLite uses a more powerful (3D convolution) consensus layer than TSN (average pooling) and TRN (fully connected) the spatio-temporal features provided by HF aggregation are found to be beneficial. From the plot showing accuracy against complexity comparison of state-of-the-art approaches shown in Fig. 3(b), one can see that HF-ECOLite performs on par with the bigger 3D CNN models.
|Method||Top-1 Accuracy (%)||Top-5 Accuracy (%)||Precision (%)||Recall (%)|
|2SCNN (FUSION) ||42.16||29.14||13.23||80.58||53.70||30.36||29.39||30.73||5.35||14.83||21.10||4.46|
|TSN (RGB) ||45.68||36.80||19.86||85.56||64.19||41.89||61.64||34.32||9.96||23.81||31.62||8.81|
|TSN (FLOW) ||42.75||17.40||9.02||79.52||39.43||21.92||21.42||13.75||2.33||15.58||9.51||2.06|
|TSN (FUSION) ||48.23||36.71||20.54||84.09||62.32||39.79||47.26||35.42||10.46||22.33||30.53||8.83|
|2SCNN (FUSION) ||36.16||18.03||7.31||71.97||38.41||19.49||18.11||15.31||2.86||10.52||12.55||2.69|
|TSN (RGB) ||34.89||21.82||10.11||74.56||45.34||25.33||19.48||14.67||4.77||11.22||17.24||5.67|
|TSN (FLOW) ||40.08||14.51||6.73||73.40||33.77||18.64||19.98||9.48||2.08||13.81||8.58||2.27|
|TSN (FUSION) ||39.4||22.70||10.89||74.29||45.72||25.26||22.54||15.33||5.60||13.06||17.52||5.81|
. The model is trained for verb, noun and action classification. We also apply the action prediction score as a bias to the verb and noun classifiers. We report the scores on the test set obtained from the evaluation server. It can be seen that HF-TSN obtained a gain ofand for verb classification on S1 and S2 settings over TSN (RGB) baseline. In fact, HF-TSN from RGB surpasses the performance of the baselines that use both RGB and optical flow, showing its capacity for extracting highly discriminative short and long term spatio-temporal features.
We presented a hierarchical aggregation scheme for video understanding architectures with CNN backbone that is lightweight and effective. In HF-Nets, adjacent feature branches interact between feature differencing and averaging as they compile higher level representations, thereby providing cheap spatio-temporal features for competitive performance. We plugged HF on top of different baseline models (TSN, TRN, ECO) and evaluated on three public action recognition datasets, obtaining consistent performance improvements. We improve action recognition accuracy of TSN on video clips of complex human-object relationships  by more than 24% while introducing only about 1% additional trainable parameters and 1% FLOPs of computation overhead in inference. Since HF-Net scheme can be plugged into any deep video architecture with CNN backbone, our future work includes the evaluation of additional baselines and two-stream solutions.
-  L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in Proc. ECCV, 2016.
-  J. Ng and L. Davis, “Temporal difference networks for video action recognition,” in Proc. IEEE Winter Conference on Applications of Computer Vision (WACV), 2018.
-  R. Goyal, S. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al., “The” Something Something” Video Database for Learning and Evaluating Visual Common Sense,” in Proc. ICCV, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, 2016.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proc. CVPR, 2016.
-  K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Proc. NIPS, 2014.
-  C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in Proc. CVPR, 2016.
-  C. Feichtenhofer, A. Pinz, and R. Wildes, “Spatiotemporal residual networks for video action recognition,” in Proc. NIPS, 2016.
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” inProc. CVPR, 2014.
-  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proc. ICCV, 2015.
-  J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in Proc. CVPR, 2017.
-  Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation with pseudo-3d residual networks,” in Proc. ICCV, 2017.
-  D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in Proc. CVPR, 2018.
-  J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in Proc. CVPR, 2015.
-  J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in Proc. CVPR, 2015.
-  B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relational reasoning in videos,” in Proc. ECCV, 2018.
-  M. Zolfaghari, K. Singh, and T. Brox, “ECO: Efficient Convolutional Network for Online Video Understanding,” in Proc. ECCV, pp. 695–712, 2018.
-  S. Sun, Z. Kuang, L. Sheng, W. Ouyang, and W. Zhang, “Optical flow guided feature: a fast and robust motion representation for video action recognition,” in Proc. CVPR, 2018.
-  M. Lee, S. Lee, S. Son, G. Park, and N. Kwak, “Motion feature network: Fixed motion filter for action recognition,” in Proc. ECCV, 2018.
-  J. Lin, C. Gan, and S. Han, “Temporal Shift Module for Efficient Video Understanding,” arXiv preprint arXiv:1811.08383, 2018.
-  S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” in Proc. ICML, 2015.
-  D. Dwibedi, P. Sermanet, and J. Tompson, “Temporal Reasoning in Videos using Convolutional Gated Recurrent Units,” in Proc. CVPR Workshops, 2018.
-  D. Damen, H. Doughty, G. Maria Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray, “Scaling egocentric vision: The epic-kitchens dataset,” in Proc. ECCV, 2018.
-  H. Kuehne, H. Jhuang, R. Stiefelhagen, and T. Serre, “HMDB51: A large video database for human motion recognition,” in High Performance Computing in Science and Engineering ‘12, pp. 571–582, Springer, 2013.
-  L. Wang, W. Li, W. Li, and L. Van Gool, “Appearance-and-relation networks for video classification,” in Proc. CVPR, 2018.
-  K. Hara, H. Kataoka, and Y. Satoh, “Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?,” in Proc. CVPR, 2018.
-  X. Wang and A. Gupta, “Videos as space-time region graphs,” in Proc. ECCV, 2018.
-  S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, “Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification,” in Proc. ECCV, 2018.