In this work, we utilize the temporal aggregates model presented in [sener2020temporal] for next action anticipation, action, and activity recognition in long-range videos, see Fig. 2. We also test our method on the new EPIC-KITCHENS-100 dataset. Our model is described in detail in [sener2020temporal], and we refer the reader to this paper for further detail.
An overview of the building blocks of our temporal aggregates framework can be found in Fig. 1. We split video streams into snippets of equal length and max-pool the frame features within the snippets. We then create ensembles of multi-scale feature representations that are aggregated bottom-up based on scaling and temporal extent. Based on different start and end frames and and number of snippets , we define two types of snippet features: ‘recent’ features from recent observations and “spanning” features drawn from the long-term video. The recent snippets cover a couple of seconds (or up to a minute, depending on the temporal granularity) before the current time point, while spanning snippets refer to the long-term past and may last up to ten minutes. In Fig. 1 we use two starting points to compute the “recent past” snippet features and represent each with number of snippets ( & ). And we use three scales to compute the “spanning past” snippet features with ( , & ). Key to both types of representations is the ensemble of snippet features from multiple scales.
Our framework is build in a bottom up manner, starting with the recent and spanning features and , which are coupled with non-local blocks (NLB) within coupling blocks (CB). Non-local operations [wang2018non] are applied to capture relationships amongst the spanning snippets and between spanning and recent snippets. Two such NLBs are combined in a Coupling Block (CB) which calculates attention-reweighted recent and spanning context representations. Each recent with all spanning representations are coupled via individual CBs and their outputs are combined in a Temporal Aggregation Block (TAB). Outputs of different TABs are then chained together for the task of interest.
2.1 Implementation Details
We train our models using the Adam optimizer [kingma2014adam] with batch size 10, learning rate and dropout rate 0.3. We train for kepochs (where k=15 if task=anticipation & k=25 if task=recognition) and decrease the learning rate by a factor of 10 every
epoch. We use 512-D vectors for all non-classification linear layers.
2.2 Recognizing Long-range Complex Activities
To validate our model further on a new task, we experiment on classifying long-range complex activities. Since these videos include multiple actions and are several minutes long, it becomes more challenging to model their temporal structure compared to short-term single action videos, see Fig.2 “activity recognition”. Recently, [hussein2019timeception] proposed a neural layer, “Timeception”, which uses multi-scale temporal-only convolutions for modelling minutes-long complex activity videos, such as “cooking a meal”. Placed on top of backbone CNNs, the permutation invariant convolution layer, PIC [hussein2020pic], also aims at modelling only the temporal dimension. PIC is invariant to temporal permutations as it models their correlations regardless of their order, which helps to handle different action orderings in videos. It also uses pairs of key-value kernels to learn the most representative visual signals in long and noisy videos.
|Dataset||# spanning scope (s)|
|I3D + Timeception [hussein2019timeception]||no||69.3|
|I3D + ours||no||80.8|
|I3D + Timeception [hussein2019timeception]||yes||86.9|
|I3D+ PIC [hussein2020pic]||yes||89.8|
We experiment on the Breakfast actions dataset [kuehne2014language], which contains 1712 videos of 10 complex activities such as “making coffee”. In our model, we divide videos into three partitions and use each partition as a recent snippet. We use the entire video for computing our spanning snippets. The model parameters are presented in Table 1.
We report our comparisons in Table 2 on Breakfast actions using two types of I3D features, where one is the features from an I3D model trained on Kinetics only, and the other is the features from an I3D model fine-tuned on the Breakfast dataset. Our method outperforms Timeception [hussein2019timeception] by 11.4%, and the I3D backbone by 16.5%. [hussein2020pic] use the fine-tuned I3D features on Breakfast and shows a 3.1% improvement over Timeception [hussein2019timeception]. Fine-tuning improves the accuracy by 16.3% and shows that there is room for improvement for our method using better feature representations.
|Task||# segments||(in seconds (s))||spanning scope (s)|
|Overall||Unseen Participants||Tail Classes|
|Overall||Unseen Participants||Tail Classes|
|Top-1 Accuracy (%)||Top-5 Accuracy (%)||Top-1 Accuracy (%)||Top-1 Accuracy (%)|
2.3 Experiments on EPIC-KITCHENS-100
Epic-Kitchens-100 [damen2020rescaling] is the recently released extension to Epic-Kitchens-55 [damen2018scaling]. It is the largest egocentric dataset with 100 hours of egocentric recordings capturing participants’ daily kitchen activities with a head-mounted camera. There are around K pre-trimmed segments extracted from 700 long videos. Each segment is annotated with an action composed of a verb and noun classes, e.g., “pour water”. There are 4,025 actions composed of 97 verbs and 300 nouns. The dataset provides RGB and optical flow images, as well as bounding boxes extracted by a hand-object detection framework [shan2020understanding].
The spanning scales , recent scale , recent starting points and recent ending points are given in Table 3. In our work, we anticipate or recognize the action classes directly rather than anticipating or recognizing the verbs and nouns independently [damen2018scaling] which is shown to outperform the latter [furnari2018leveraging]. We use the training and validation sets provided by [damen2020rescaling] for selecting our model parameters.
We use the appearance (RGB), motion (optical flow), and object-based features provided by [furnari2019rulstm] for reporting the baseline results on EPIC-100. They independently train two CNNs using the TSN [wang2016temporal] framework on RGB and flow images for action recognition on EPIC-Kitchens-100. [furnari2019rulstm] also trains object detectors to recognize the 352 object classes of the EPIC-KITCHENS-100 dataset.
We additionally extract regions of interest (ROI) features from this pre-trained TSN model (on RGB), provided by [furnari2019rulstm], for the hand-object interaction regions in frames. We use the interacting hand-object bounding boxes provided by [shan2020understanding] and consider the union of these boxes to be our ROI for each frame. The RGB features from this ROI help our model ignore the background clutter, which adversely affects our performance as it focuses primarily on interacting regions. The feature dimensions are 1024, 1024, and 352, 1024 for appearance, motion, object, and ROI features, respectively.
2.3.2 Anticipation on EPIC-KITCHENS-100
The anticipation task of EPIC-KITCHENS-100 requires anticipating the future action s before it starts. We train our model separately for each feature modality (appearance, motion, object and RoI) with the parameters described in Table 3 and apply late fusion to the predictions from all these modalities by average voting to compute our final results.
Table 4 summarizes our results (class-mean top-5 recall (%)) for validation and hold-out test sets on EPIC-KITCHENS-100 for all (overall) and unseen participants and tail classes. Overall the ensemble of all modalities improves action scores for overall, unseen and tail classes while training our model solely on RGB performs better for verb and noun scores. Our submission in the challenge leaderboard 111Test results obtained by submission to https://competitions.codalab.org/competitions/25925#results is named as “temporalAgg”.
2.3.3 Recognition on EPIC-KITCHENS-100
For recognition, we classify pre-trimmed action segments. We train our model separately for each feature modality using the model parameters described in Table 3. During inference, similar to our anticipation, a late fusion of the predictions from modalities RGB, Flow, Obj, and ROI is used.
Following the EPIC-KITCHENS-100 protocol [damen2020rescaling], we report Top-1/5 accuracies on both the validation and test sets for all (overall) and unseen participants and tail classes in Table 5. Fusing all modalities improves all scores significantly for all evaluation categories. Our submission in the challenge leaderboard 222Test results obtained by submission to https://competitions.codalab.org/competitions/25923#results is “temporalAgg”.