Technical Report: Temporal Aggregate Representations

This technical report extends our work presented in [9] with more experiments. In [9], we tackle long-term video understanding, which requires reasoning from current and past or future observations and raises several fundamental questions. How should temporal or sequential relationships be modelled? What temporal extent of information and context needs to be processed? At what temporal scale should they be derived? [9] addresses these questions with a flexible multi-granular temporal aggregation framework. In this report, we conduct further experiments with this framework on different tasks and a new dataset, EPIC-KITCHENS-100.


Temporal Aggregate Representations for Long Term Video Understanding

Future prediction requires reasoning from current and past observations ...

Semistability-Based Convergence Analysis for Paracontracting Multiagent Coordination Optimization

This sequential technical report extends some of the previous results we...

Deep Video Matting via Spatio-Temporal Alignment and Aggregation

Despite the significant progress made by deep learning in natural image ...

Top-1 Solution of Multi-Moments in Time Challenge 2019

In this technical report, we briefly introduce the solutions of our team...

Long-term Multi-granularity Deep Framework for Driver Drowsiness Detection

For real-world driver drowsiness detection from videos, the variation of...

Clockwork Variational Autoencoders

Deep learning has enabled algorithms to generate realistic images. Howev...

Security of OS-level virtualization technologies: Technical report

The need for flexible, low-overhead virtualization is evident on many fr...

Code Repositories


[ECCV 2020] Temporal Aggregate Representations for Long-Range Video Understanding

view repo

1 Introduction

In this work, we utilize the temporal aggregates model presented in [sener2020temporal] for next action anticipation, action, and activity recognition in long-range videos, see Fig. 2. We also test our method on the new EPIC-KITCHENS-100 dataset. Our model is described in detail in [sener2020temporal], and we refer the reader to this paper for further detail.

Figure 1: Model overview: In this example we use 3 scales for computing the “spanning past” snippet features , and 2 starting points to compute the “recent past” snippet features,

, by max-pooling over the frame features in each snippet. Each recent snippet is coupled with all the spanning snippets in our Temporal Aggregation Block (TAB). An ensemble of TAB outputs is used for next action anticipation. Best viewed in color.

Figure 2: Our temporal aggregates model [sener2020temporal] is very flexible in that it can be utilized for different tasks easily.

An overview of the building blocks of our temporal aggregates framework can be found in Fig. 1. We split video streams into snippets of equal length and max-pool the frame features within the snippets. We then create ensembles of multi-scale feature representations that are aggregated bottom-up based on scaling and temporal extent. Based on different start and end frames and and number of snippets , we define two types of snippet features: ‘recent’ features from recent observations and “spanning” features drawn from the long-term video. The recent snippets cover a couple of seconds (or up to a minute, depending on the temporal granularity) before the current time point, while spanning snippets refer to the long-term past and may last up to ten minutes. In Fig. 1 we use two starting points to compute the “recent past” snippet features and represent each with number of snippets ( & ). And we use three scales to compute the “spanning past” snippet features with ( , & ). Key to both types of representations is the ensemble of snippet features from multiple scales.

Our framework is build in a bottom up manner, starting with the recent and spanning features and , which are coupled with non-local blocks (NLB) within coupling blocks (CB). Non-local operations [wang2018non] are applied to capture relationships amongst the spanning snippets and between spanning and recent snippets. Two such NLBs are combined in a Coupling Block (CB) which calculates attention-reweighted recent and spanning context representations. Each recent with all spanning representations are coupled via individual CBs and their outputs are combined in a Temporal Aggregation Block (TAB). Outputs of different TABs are then chained together for the task of interest.

2 Experiments

2.1 Implementation Details

We train our models using the Adam optimizer [kingma2014adam] with batch size 10, learning rate and dropout rate 0.3. We train for kepochs (where k=15 if task=anticipation & k=25 if task=recognition) and decrease the learning rate by a factor of 10 every

epoch. We use 512-D vectors for all non-classification linear layers.

2.2 Recognizing Long-range Complex Activities

To validate our model further on a new task, we experiment on classifying long-range complex activities. Since these videos include multiple actions and are several minutes long, it becomes more challenging to model their temporal structure compared to short-term single action videos, see Fig. 

2 “activity recognition”. Recently, [hussein2019timeception] proposed a neural layer, “Timeception”, which uses multi-scale temporal-only convolutions for modelling minutes-long complex activity videos, such as “cooking a meal”. Placed on top of backbone CNNs, the permutation invariant convolution layer, PIC [hussein2020pic], also aims at modelling only the temporal dimension. PIC is invariant to temporal permutations as it models their correlations regardless of their order, which helps to handle different action orderings in videos. It also uses pairs of key-value kernels to learn the most representative visual signals in long and noisy videos.

Dataset # spanning scope (s)
Breakfast entire video 5
Table 1: Model parameters for activity recognition on Breakfast.
Method Fine-tuning Accuracy (%)
I3D no 64.3
I3D + Timeception [hussein2019timeception] no 69.3
I3D + ours no 80.8
I3D yes 80.6
I3D + Timeception [hussein2019timeception] yes 86.9
I3D+ PIC [hussein2020pic] yes 89.8
Table 2: Comparisons to methods developed for recognizing long-range complex activities, Timeception [hussein2019timeception] and PIC [hussein2020pic] on the Breakfast Actions dataset. Our method outperforms Timeception [hussein2019timeception] by a significant margin showing the superiority of our method in modelling long-range activities.

We experiment on the Breakfast actions dataset [kuehne2014language], which contains 1712 videos of 10 complex activities such as “making coffee”. In our model, we divide videos into three partitions and use each partition as a recent snippet. We use the entire video for computing our spanning snippets. The model parameters are presented in Table 1.

We report our comparisons in Table 2 on Breakfast actions using two types of I3D features, where one is the features from an I3D model trained on Kinetics only, and the other is the features from an I3D model fine-tuned on the Breakfast dataset. Our method outperforms Timeception [hussein2019timeception] by 11.4%, and the I3D backbone by 16.5%. [hussein2020pic] use the fine-tuned I3D features on Breakfast and shows a 3.1% improvement over Timeception [hussein2019timeception]. Fine-tuning improves the accuracy by 16.3% and shows that there is room for improvement for our method using better feature representations.

Task # segments (in seconds (s)) spanning scope (s)
Anticipation 90K 6 2
Recognition 90K (, ) 5
Table 3: Dataset details and our respective model parameters for anticipation and recognition. s and e refers to the start and end times of the segments for action recognition.
Overall Unseen Participants Tail Classes
Split Modality Verb Noun Act. Verb Noun Act. Verb Noun Act.


RGB 24.22 29.76 13.02 27.04 22.95 12.21 16.23 22.93 10.41
Flow 18.90 18.68 7.27 26.53 18.86 9.54 10.65 12.53 5.25
Obj 20.45 27.64 10.45 24.17 24.71 11.45 12.55 19.31 7.36
ROI 21.22 26.61 11.62 25.49 19.16 10.10 13.36 19.91 9.10
Fusion 23.15 31.37 14.73 28.01 26.23 14.47 14.50 22.47 11.75


Fusion 21.76 30.59 12.55 17.86 27.04 10.46 13.59 20.62 8.85
Table 4: Action anticipation results (class-mean top-5 recall) on EPIC-KITCHENS-100 validation and test sets. We report our results for RGB, Flow, Obj and ROI modalities and the late fusion of the predictions from all these modalities (Fusion).
Overall Unseen Participants Tail Classes
Top-1 Accuracy (%) Top-5 Accuracy (%) Top-1 Accuracy (%) Top-1 Accuracy (%)
Split Modality Verb Noun Act. Verb Noun Act. Verb Noun Act. Verb Noun Act.


RGB 59.92 45.14 36.87 86.72 71.09 56.97 47.51 31.46 27.51 26.82 21.89 18.04
Flow 62.81 37.51 32.84 87.70 63.22 53.17 53.80 32.86 28.26 27.23 8.00 12.72
Obj 49.89 41.70 30.97 84.02 73.56 54.00 48.36 34.84 27.63 23.64 20.63 14.27
ROI 57.75 43.16 35.48 86.36 70.87 56.65 47.32 34.65 26.76 26.42 20.16 16.94
Fusion 66.00 53.35 45.26 89.39 80.40 66.93 56.62 44.60 38.12 30.57 25.26 22.42


Fusion 62.68 51.66 42.65 86.28 73.85 64.05 56.25 46.33 36.06 24.99 18.00 15.92
Table 5: Action recognition results on EPIC-KITCHENS-100 validation and test sets. We report our results for modalities RGB, Flow, Obj and ROI and late fusion of the predictions from all these modalities (Fusion).

2.3 Experiments on EPIC-KITCHENS-100

Epic-Kitchens-100 [damen2020rescaling] is the recently released extension to Epic-Kitchens-55 [damen2018scaling]. It is the largest egocentric dataset with 100 hours of egocentric recordings capturing participants’ daily kitchen activities with a head-mounted camera. There are around K pre-trimmed segments extracted from 700 long videos. Each segment is annotated with an action composed of a verb and noun classes, e.g., “pour water”. There are 4,025 actions composed of 97 verbs and 300 nouns. The dataset provides RGB and optical flow images, as well as bounding boxes extracted by a hand-object detection framework [shan2020understanding].


The spanning scales , recent scale , recent starting points and recent ending points are given in Table 3. In our work, we anticipate or recognize the action classes directly rather than anticipating or recognizing the verbs and nouns independently [damen2018scaling] which is shown to outperform the latter [furnari2018leveraging]. We use the training and validation sets provided by [damen2020rescaling] for selecting our model parameters.

2.3.1 Features

We use the appearance (RGB), motion (optical flow), and object-based features provided by  [furnari2019rulstm] for reporting the baseline results on EPIC-100. They independently train two CNNs using the TSN  [wang2016temporal] framework on RGB and flow images for action recognition on EPIC-Kitchens-100.  [furnari2019rulstm] also trains object detectors to recognize the 352 object classes of the EPIC-KITCHENS-100 dataset.

We additionally extract regions of interest (ROI) features from this pre-trained TSN model (on RGB), provided by  [furnari2019rulstm], for the hand-object interaction regions in frames. We use the interacting hand-object bounding boxes provided by [shan2020understanding] and consider the union of these boxes to be our ROI for each frame. The RGB features from this ROI help our model ignore the background clutter, which adversely affects our performance as it focuses primarily on interacting regions. The feature dimensions are 1024, 1024, and 352, 1024 for appearance, motion, object, and ROI features, respectively.

2.3.2 Anticipation on EPIC-KITCHENS-100

The anticipation task of EPIC-KITCHENS-100 requires anticipating the future action s before it starts. We train our model separately for each feature modality (appearance, motion, object and RoI) with the parameters described in Table 3 and apply late fusion to the predictions from all these modalities by average voting to compute our final results.

Table 4 summarizes our results (class-mean top-5 recall (%)) for validation and hold-out test sets on EPIC-KITCHENS-100 for all (overall) and unseen participants and tail classes. Overall the ensemble of all modalities improves action scores for overall, unseen and tail classes while training our model solely on RGB performs better for verb and noun scores. Our submission in the challenge leaderboard 111Test results obtained by submission to is named as “temporalAgg”.

2.3.3 Recognition on EPIC-KITCHENS-100

For recognition, we classify pre-trimmed action segments. We train our model separately for each feature modality using the model parameters described in Table 3. During inference, similar to our anticipation, a late fusion of the predictions from modalities RGB, Flow, Obj, and ROI is used.

Following the EPIC-KITCHENS-100 protocol [damen2020rescaling], we report Top-1/5 accuracies on both the validation and test sets for all (overall) and unseen participants and tail classes in Table 5. Fusing all modalities improves all scores significantly for all evaluation categories. Our submission in the challenge leaderboard 222Test results obtained by submission to is “temporalAgg”.