Implementation of "EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition, ICCV, 2019" in PyTorch
We focus on multi-modal fusion for egocentric action recognition, and propose a novel architecture for multi-modal temporal-binding, i.e. the combination of modalities within a range of temporal offsets. We train the architecture with three modalities -- RGB, Flow and Audio -- and combine them with mid-level fusion alongside sparse temporal sampling of fused representations. In contrast with previous works, modalities are fused before temporal aggregation, with shared modality and fusion weights over time. Our proposed architecture is trained end-to-end, outperforming individual modalities as well as late-fusion of modalities. We demonstrate the importance of audio in egocentric vision, on per-class basis, for identifying actions as well as interacting objects. Our method achieves state of the art results on both the seen and unseen test sets of the largest egocentric dataset: EPIC-Kitchens, on all metrics using the public leaderboard.READ FULL TEXT VIEW PDF
In this report, our approach to tackling the task of ActivityNet 2018
Our interaction with the world is an inherently multimodal experience.
We present a method for gesture detection and localisation based on
We propose a novel deep fusion architecture, CaloriNet, for the online
Recognizing sounds is a key aspect of computational audio scene analysis...
Egocentric action anticipation consists in understanding which objects t...
Smart devices of everyday use (such as smartphones and wearables) are
Implementation of "EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition, ICCV, 2019" in PyTorch
With the availability of multi-sensor wearable devices (e.g. GoPro, Google Glass, Microsoft Hololens, MagicLeap), egocentric audio-video recordings have become popular in many areas such as extreme sports, health monitoring, life logging, and home automation. As a result, there has been a renewed interest from the computer vision community on collecting large-scale datasets[8, 35] as well as developing new or adapting existing methods to the first-person point-of-view scenario [46, 32, 17, 21, 9, 44].
In this work, we explore audio as a prime modality to provide complementary information to visual modalities (appearance and motion) in egocentric action recognition. While audio has been explored in video understanding in general [3, 5, 2, 29, 28, 6, 27, 23, 34, 11] the egocentric domain in particular offers rich sounds resulting from the interactions between hands and objects, as well as the close proximity of the wearable microphone to the undergoing action. Audio is a prime discriminator for some actions (e.g. ‘wash’, ‘fry’) as well as objects within actions (e.g. ‘put plate’ vs ‘put bag’). At times, the temporal progression (or change) of sounds can separate visually ambiguous actions (e.g. ‘open tap’ vs ‘close tap’). Audio can also capture actions that are out of the wearable camera’s field of view, but audible (e.g. ‘eat’ can be heard but not seen). Conversely, other actions are sound-less (e.g. ‘wipe hands’) and the wearable sensor might capture irrelevant sounds, such as talking or music playing in the background. The opportunities and challenges of incorporating audio in egocentric action recognition allow us to explore new multi-sensory fusion approaches, particularly related to the potential temporal asynchrony between the action’s appearance and the discriminative audio signal – the main focus of our work.
While several multi-modal fusion architectures exist for action recognition, current approaches perform temporal aggregation within each modality before modalities are fused [42, 22] or embedded . Works that do fuse inputs before temporal aggregation, e.g. , do so with inputs synchronised across modalities. In Fig. 1, we show an example of ‘breaking an egg into a pan’ from the EPIC-Kitchens dataset. The distinct sound of cracking the egg, the motion of separating the egg and the change in appearance of the egg occur at different frames/temporal positions within the video. Approaches that fuse modalities with synchronised input would thus be limited in their ability to learn such actions. In this work, we explore fusing inputs within a Temporal Binding Window (TBW) (Fig 1), allowing the model to train using asynchronous inputs from the various modalities. Evidence in neuroscience and behavioural sciences points at the presence of such a TBW in humans [30, 41]. The TBW offers a “range of temporal offsets within which an individual is able to perceptually bind inputs across sensory modalities” . This is triggered by the gap in the biophysical time to process different senses . Interestingly, the width of the TBW in humans is heavily task-dependant, shorter for simple stimuli such as flashes and beeps and intermediate for complex stimuli such as a hammer hitting a nail .
Combining our explorations into audio for egocentric action recognition, and using a TBW for asynchronous modality fusion, our contributions are summarised as follows. First, an end-to-end trainable mid-level fusion Temporal Binding Network (TBN) is proposed111 Code at: http://github.com/ekazakos/temporal-binding-network. Second, we present the first audio-visual fusion attempt in egocentric action recognition. Third, we achieve state-of-the-art results on the EPIC-Kitchens public leaderboards on both seen and unseen test sets. Our results show (i) the efficacy of audio for egocentric action recognition, (ii) the advantage of mid-level fusion within a TBW over late fusion, and (iii) the robustness of our model to background or irrelevant sounds.
We divide the related works into three groups: works that fuse visual modalities (RGB and Flow) for action recognition (AR), works that fuse modalities for egocentric AR in particular, and finally works from the recent surge in interest of audio-visual correspondence and source separation.
Visual Fusion for AR: By observing the importance of spatial and temporal features for AR, two-stream (appearance and motion) fusion has become a standard technique [36, 10, 42]. Late fusion, first proposed by Simonyan and Zisserman , combines the streams’ independent predictions. Feichtenhofer  proposed mid-level fusion of the spatial and temporal streams, showing optimal results by combining the streams after the last convolutional layer. In , 3D convolution for spatial and motion streams was proposed, followed by late fusion of modalities. All these approaches do not model the temporal progression of actions, a problem addressed by . Temporal Segment Networks (TSN)  perform sparse temporal sampling followed by temporal aggregation (averaging) of softmax scores across samples. Each modality is trained independently, with late fusion of modalities by averaging their predictions. Follow-up works focus on pooling for temporal aggregation, still training modalities independently [45, 13]. Modality fusion before temporal aggregation was proposed in , where the appearance of the current frame is fused with 5 uniformly sampled motion frames, and vice versa, using two temporal models (LSTM). While their motivation is similar to ours, their approach focuses on using predefined asynchrony offsets between two modalities. In contrast, we relax this constraint and allow fusion from any random offset within a temporal window, which is more suitable for scaling up to many modalities.
Fusion in Egocentric AR: Late fusion of appearance and motion has been frequently used in egocentric AR [8, 38, 24, 40], as well as extended to additional streams aimed at capturing egocentric cues [21, 37, 38]. In , the spatial stream segments hands and detects objects. The streams are trained jointly with a triplet loss on objects, actions and activities, and fused through concatenation.  uses head motion features, hand masks, and saliency maps, which are stacked and fed to both a 2D and a 3D ConvNet, and combined by late fusion. All previous approaches have relied on small-scale egocentric datasets, and none utilised audio for egocentric AR.
Audio-Visual Learning: Over the last three years, significant attention has been paid in computer vision to an underutilised and readily available source of information existing in video: the audio stream [3, 5, 2, 29, 28, 6, 27, 23, 34, 11]. These fall in one of four categories: i) audio-visual representation learning [2, 5, 6, 23, 28, 29], ii) sound-source localisation [3, 28, 34], iii) audio-visual source separation [28, 11] and (iv) visual-question answering . These approaches attempt fusion [2, 28] or embedding into a common space [3, 6, 26]. Several works sample the two modalities with temporal shifts, for learning better synchronous representations [28, 16]. Others sample within a 1s temporal window, to learn a correspondence between the modalities, e.g. [2, 3]. Of these works, [28, 16] note this audio-visual representation learning could be used for AR, by pretraining on the self-supervised task and then fine-tuning for AR.
Fusion for AR using three modalities (appearance, motion and audio) has been explored in , employing late-fusion of predictions, and [19, 20] using attention to integrate local features into a global representation. Tested on UCF101,  shows audio to be the least informative modality for third person action recognition (16% accuracy for audio compared to 80% and 78% for spatial and motion). A similar conclusion was made for other third-person datasets (AVA  and Kinetics [19, 20]).
In this work, we show audio to be a competitive modality for egocentric AR on EPIC-Kitchens, achieving comparable performance to appearance. We also demonstrate that audio-visual modality fusion in egocentric videos improves the recognition performance of both the action and the accompanying object.
Our goal is to find the optimal way to fuse multiple modality inputs while modelling temporal progression through sampling. We first explain the general notion of temporal binding of multiple modalities in Sec 3.1, then detail our architecture in Sec 3.2.
Consider a sequence of samples from one modality in a video stream, where is the video’s duration and
is the modality’s framerate (or frequency of sampling). Input samples are first passed through unimodal feature extraction functions. To account for varying representation sizes and frame-rates, most multi-modal architectures apply pooling functions to each modality in the form of average pooling or other temporal pooling functions (e.g. maximum or VLAD ), before attempting multimodal fusion.
Given a pair of modalities and , the final class predictions for a video are hence obtained as follows:
where and are unimodal feature extraction functions, is a temporal aggregation function, is the multimodal fusion function and is the output label for the video. In such architectures (e.g. TSN ), modalities are temporally aggregated for a prediction before different modalities are fused; this is typically referred to as ‘late fusion’.
Conversely, multimodal fusion can be performed at each time step as in . One way to do this would be to synchronise modalities and perform a prediction at each time-step. For modalities with matching frame rates, synchronised multi-modal samples can be selected as , and fused according to the following equation:
where is a multimodal feature extractor that produces a representation for each time step , and then performs temporal aggregation over all time steps. When frame rates vary, and more importantly so do representation sizes, only approximate synchronisation can be attempted,
We refer to this approach as ‘synchronous fusion’ where synchronisation is achieved or approximated.
In this work, however, we propose fusing modalities within temporal windows. Here modalities are fused within a range of temporal offsets, with all offsets constrained to lie within a finite time window, which we henceforth refer to as a temporal binding window (TBW). Formally,
where is a multimodal feature extractor that combines inputs within a binding window of width . Interestingly, as the number of modalities increases, say from two to three modalities, the TBW representation allows fusion of modalities each with different temporal offsets, yet within the same binding window :
This formulation hence allows a large number of different inputs combinations to be fused. This is different from proposals that fuse inputs over predefined temporal differences (e.g. ). Sampling within a temporal window allows fusing modalities with various temporal shifts, up to the temporal window width . This: 1) enables straightforward scaling to multiple modalities with different frame rates, 2) allows training with a variety of temporal shifts, accommodating, say, different speeds of action performance and 3) provides a natural form of data augmentation.
With the basic concept of a TBW in place, we now describe our proposed audio-visual fusion model, TBN.
Our proposed TBN architecture is shown in Fig 2 (left). First, the action video is divided into segments of equal width. Within each segment, we select a random sample of the first modality . This ensures the temporal progression of the action is captured by sparse temporal sampling of this modality, as with previous works [42, 45], while random sampling within the segment offers further data for training. The sampled is then used as the centre of a TBW of width . The other modalities are selected randomly from within each TBW (Eq. 3.1). In total, the input to our architecture in both training and testing is samples from modalities.
Within each of the TBWs, we argue that the complementary information in audio and vision can be better exploited by combining the internal representations of each modality before temporal aggregation, and hence we propose a mid-level fusion. A ConvNet (per modality) extracts mid-level features, which are then fused through concatenating
the modality features and feeding them to a fully-connected layer, making multi-modal predictions per TBW. We backpropagate all the way to the inputs of the ConvNets. Fig3 details the proposed TBN block. The predictions, for each of these unified multimodal representations, are then aggregated for video-level predictions. In the proposed architecture, we train all modalities simultaneously. The convolutional weights for each modality are shared over the segments. Additionally, mid-level fusion weights and class prediction weights are also shared across the segments.
To avoid biasing the fusion towards longer or shorter action lengths, we calculate the window width relative to the action video length. Our TBW is thus of variable width, where the width is a function of the length of the action. We note again that can be set independently of the number of segments , allowing the temporal windows to overlap. This is detailed in Sec. 4.1.
In Fig 2, we contrast the TBN architecture (left) to an extended version of the TSN architecture (right). The extension is to include the audio modality, since the original TSN only utilises appearance and motion streams. There are two key differences: first, in TSN each modality is temporally aggregated independently (across segments), and the modalities are only combined by late fusion (e.g. the RGB scores of each segment are temporally aggregated, and the flow scores of each segment are temporally aggregated, individually). Hence, it is not possible to benefit from combining modalities within a segment which is the case for TBN. Second, in TSN, each modality is trained independently first after which predictions are combined in inference. In the TBN model instead, all modalities are trained simultaneously, and their combination is also learnt.
Dataset: We evaluate the TBN architecture on the largest dataset in egocentric vision: EPIC-Kitchens , which contains action segments recorded by participants performing non-scripted daily activities in their native kitchen environments. In EPIC-Kitchens, an action is defined as a combination of a verb and a noun, ‘cut cheese’. There are in total verb classes and noun classes, though these are heavily-imbalanced. The test set is divided in two splits: Seen Kitchens (S1) where sequences from the same environment are in both training, and Unseen Kitchens (S2) where the complete sequences for participants are held out for testing. Importantly, EPIC-Kitchens sequences have been captured using a head-mounted Go-Pro with the audio released as part of the dataset. No previous baseline on using audio for this dataset is available.
RGB and Flow: We use the publicly available RGB and computed optical flow with the dataset .
Audio Processing: We extract s of audio, convert it to single-channel, and resample it to 24kHz. We then convert it to a log-spectrogram representation using an STFT of window length ms, hop length ms and frequency bands. This results in a 2D spectrogram matrix of size , after which we compute the logarithm. Since many egocentric actions are very short (s), we extract s of audio from the untrimmed video, allowing the audio segment to extend beyond the action boundaries.
We implement our model in PyTorch. We use Inception with Batch Normalisation (BN-Inception)  as a base architecture, and fuse the modalities after the average pooling layer. We chose BN-Inception as it offers a good compromise between performance and model-size, critical for our proposed TBN that trains all modalities simultaneously, and hence is memory-intensive. Compared to TSN, the three modalities have 10.78M, 10.4M and 10.4M parameters, with only one modality in memory during training. In contrast, TBN has 32.64M paramaters.
We train using SGD with momentum , a batch size of , a dropout of , a momentum of , and a learning rate of . Networks are trained for epochs, and the learning rate is decayed by a factor of 10 at epoch
. We initialise the RGB and the Audio streams from ImageNet. While for the Flow stream, we use stacks of 10 interleaved horizontal and vertical optical flow frames, and use the pre-trained Kinetics model, provided by the authors of .
Note that our network is trained end-to-end for all modalities and TBWs. We train with segments over the modalities, with , allowing the temporal window to be as large as the action segment. We test using evenly spaced samples for each modality, as with the TSN basecode for direct comparison.
This section is organised as follows. First, we show and discuss the performance of single modalities, and compare them with our proposed TBN, with a special focus on the efficacy of the audio stream. Second, we compare different mid-level fusion techniques. And finally, we investigate the effect of the TBW width on both training and testing.
||Top-1 Accuracy||Top-5 Accuracy||Avg Class Precision||Avg Class Recall|
Single-modality vs multimodal fusion performance: We examine the overall performance of each modality individually in Table 1. Although it is clear that RGB and optical flow are stronger modalities than audio, an interesting find is that audio performs comparably to RGB on some of the metrics (e.g. top-1 verb accuracy), signifying the relevance of audio on recognising egocentric actions. While as expected optical flow outperforms RGB in S2, interestingly for S1, the RGB and Flow modalities perform comparatively, and in some cases RGB performs better. This matches the expectation that Flow is more invariant to the environment.
To obtain a better analysis of how these modalities perform, we examine the accuracy of individual verb and noun classes on S1, using single modalities. Fig 4 plots top-performing verb and noun classes, into a Venn diagram. For each class, we consider the accuracy of individual modalities. If all modalities perform comparably (within 0.15), we plot that class in the intersection of the three circles. On the other hand, if one modality is clearly better than the others (more than 0.15), we plot the class in the outer part of the modality’s circle. For example, for the verb ‘close’, we have per-class accuracy of 0.23, 0.47 and 0.42 for RGB, Flow and Audio respectively. We thus note that this class performs best for two modalities: Flow and Audio, and plot it in the intersection of these two circles.
From this plot, many verb and noun classes perform comparably for all modalities (e.g. ‘wash’, ‘peel’ and ‘fridge’, ‘sponge’). This suggests all three modalities contain useful information for these tasks. A distinctive difference, however, is observed in the importance of individual modalities for verbs and nouns. Verb classes are strongly related to the temporal progression of actions, making Flow more important for verbs than nouns. Conversely, noun classes can be predicted with high accuracy using RGB alone. Audio, on the other hand, is important for both nouns and verbs, particularly for some verbs such as ‘turn-on’, and ‘spray’. For nouns, Audio tends to perform better for objects with distinctive sounds (e.g. ‘switch’, ‘extractor fan’) and materials that sound when manipulated (e.g. ‘foil’).
In Table. 1, we compare single modality performance to the performance over the three modalities. Single modalities are trained as in TSN, as TBN is designed to bind multiple modalities. We find that the fusion method outperforms single modalities, and that audio is a significantly informative modality across the board. Per-class accuracies, for individual modalities as well as for TBN trained on all three modalities, can be seen in Figure 6. The advantage of the fusion method is more pronounced for verbs (where we expect motion and audio to be more informative) than nouns, and more for particular noun classes than others, such as ‘pot’, ‘kettle’, ‘microwave’, and particular verb classes eg. ‘spray’ (fusion 0.54, RGB 0.09, Flow 0, Audio 0.3). This suggests that the mixture of complementary and redundant information captured in a video is highly dependant on the action itself, yielding the fusion method to be more useful for some classes than for others. We also note that the fusion method helps to significantly boost the performance of the tail classes (Fig. 6, right and table in Appendix C), where individual modality performance tends to suffer.
||Top-1 Accuracy||Top-5 Accuracy||Avg Class Precision||Avg Class Recall|
|Context gating ||63.77||44.33||33.47||90.04||69.09||54.10||57.31||42.20||21.72||45.63||41.53||20.20|
|Gating fusion ||61.52||43.54||31.61||89.54||68.42||52.57||52.07||39.62||18.39||42.55||39.77||18.66|
|Context gating ||52.65||27.35||19.16||79.25||52.00||36.40||30.82||23.16||11.72||23.39||25.03||12.58|
|Gating fusion ||50.16||27.25||18.41||78.80||50.84||34.04||28.42||22.42||12.34||23.92||24.15||13.14|
Efficacy of audio: We train TBN only with the visual modalities (RGB+Flow) and the results can be seen in Table 1. An increase of (S1) and (S2) in top-5 action recognition accuracy with the addition of audio demonstrates the importance of audio for egocentric action recognition. Fig 5 shows the confusion matrix with the utilisation of audio for the largest-15 verb classes (in S1). Studying the difference (Fig 5 right) clearly demonstrates an increase (blue) in confidence along the diagonal, and a decrease (red) in confusion elsewhere.
||Top-1 Accuracy||Top-5 Accuracy||Avg Class Precision||Avg Class Recall|
|Attention Clusters ||40.39||19.37||11.09||78.13||41.73||24.36||21.17||09.65||02.50||14.89||11.50||03.41|
| (from leaderboard)||48.23||36.71||20.54||84.09||62.32||39.79||47.26||35.42||11.57||22.33||30.53||09.78|
|Ours (TSN  w. Audio)||55.49||36.27||23.95||87.04||64.17||44.26||53.85||30.94||13.55||30.60||29.82||11.11|
|Ours (TBN, Single Model)||64.75||46.03||34.80||90.70||71.34||56.65||55.67||43.65||22.07||45.55||42.30||21.31|
|Ours (TBN, Ensemble)||66.10||47.89||36.66||91.28||72.80||58.62||60.74||44.90||24.02||46.82||43.89||22.92|
|Attention Clusters ||32.37||11.95||05.60||69.89||31.82||15.74||17.21||03.86||01.84||11.59||07.94||02.64|
| (from leaderboard)||39.40||22.70||10.89||74.29||45.72||25.26||22.54||15.33||06.21||13.06||17.52||06.49|
|Ours (TSN  w. Audio)||46.61||22.50||13.05||78.19||48.59||29.13||28.92||15.48||06.47||21.58||16.61||07.55|
|Ours (TBN, Single Model)||52.69||27.86||19.06||79.93||53.78||36.54||31.44||21.48||12.00||28.21||23.53||12.69|
||Ours (TBN, Ensemble)||54.46||30.39||20.97||81.23||55.69||39.40||32.57||21.68||10.96||27.60||25.58||13.31|
Audio with irrelevant sounds: In the recorded videos for EPIC-Kitchens, background sounds irrelevant to the observed actions have been captured by the wearable sensor. These include music or TV playing in the background, ongoing washing machine, coffee machine or frying sounds while actions take place. To quantify the effect of these sounds, we annotated the audio in the test set, and report that of all action segments in S1, and of all action segments in S2 contain other audio sources. We refer to these as actions containing ‘irrelevant’ sounds, and independently report the results in Table 2. The table shows that the model’s accuracy increases consistently when audio is incorporated, even for the ‘irrelevant’ segments. Both models (All and RGB+Flow) show a drop in performance for ‘irrelevant’ S2 (comparing to ‘rest’), validating that irrelevant sounds are not the source of confusion, but that this set of action segments is more challenging even in the visual modalities. This demonstrates the robustness of our network to noisy and unconstrained audio sources.
Comparison of fusion strategies: As Fig 2 indicates, TBN performs mid-level fusion on the modalities within the binding window. Here we describe three alternative mid-level fusion strategies, and then compare their performances.
(i) Concatenation, where the feature maps of each modality are concatenated, and a fully-connected layer is used to model the cross-modal relations.
is a non-linear activation function. When used within TBWs, shared weightsare to be learnt between modalities within a range of temporal shifts.
(ii) Context gating was used in , aiming to recalibrate the strength of the activations of different units with a self-gating mechanism:
(iii) Gating fusion was introduced in 
, where a gate neuron takes as input the features from all modalities to learn the importance of one modality w.r.t. all modalities.
In Table 3, we compare the various fusion strategies. We find that the simplest method, concatenation (Eq. 6) generally outperforms more complex fusion approaches. We believe this shows modality binding within a temporal binding window to be robust to the mid-level fusion method.
The effect of TBW width: Here, we investigate the effect of the TBW width in training and testing. We varied TBW width in training with , by training three TBN models for each respective window width. We noted little difference in performance. As changing in training is expensive and performance is subject to the particular optimisation run, we opt for a more conclusive test by focusing on varying in testing for a single model.
In testing, we vary . This corresponds, in average, to varying the width of TBW on the S1 test set between 60ms and 1200ms. We additionally run with synchrony . In each case we sample a single TBW
, to solely assess the effect of the window size. We repeat this experiment for 100 runs and report mean and standard deviation in Fig.7, where we compare results for verb and noun classes separately. The figure shows that best performance is achieved for , that is on average . TBWs of smaller width show a clear drop in performance, with synchrony comparable to . Note that the ‘Sync’ baseline provides only approximate synchronisation of modalities, as modalities have different sampling rates (RGB 60fps, flow 30fps, audio 24000kHz). The model shows a degree of robustness for larger TBWs.
Note that in Fig. 7, we compare widths on a single temporal window in testing. When we temporally aggregate multiple TBWs, the effect of the TBW width is smoothed, and the model becomes robust to TBW widths.
Comparison with the state-of-the-art: We compare our work to the baseline results reported in  in Table 4 on all metrics. First we show that a late fusion with an additional audio stream, outperforms the baseline on top-1 verb accuracy by 7% on S1 and also 7% on S2. Second, we show that our TBN single model, improves these results even further (9%, 10% and 11% on top-1 verb, noun and action accuracy on S1, and 6%, 5% and 6% on S2 respectively). Finally we report results of an Ensemble of five TBNs, where each one is trained with a different TBW width. The ensemble shows additional improvement of up to 3% on top-1 metrics.
We compare TBN with Attention Clusters , a previous effort to utilise RGB, Flow, and Audio for action recognition, using pre-extracted features. We use the authors available implementation, and fine-tuned features (TSN, BN-Inception), from the global avg pooling layer (1024D), to provide a fair comparison to TBN, and follow the implementation choices from . The method from  performs significantly worse than the baseline, as pre-extracted video features are used to learn attention weights.
At the time of submission, our TBN Ensemble results demonstrated an overall improvement over all state-of-the-art, published or anonymous, by 11% on top-1 verb for both S1 and S2. Our method was also ranked 2nd in the 2019 EPIC-Kitchens Action Recognition challenge. Details of the public leaderboard are provided in Appendix B.
We have shown that the TBN architecture is able to flexibly combine the RGB, Flow and Audio modalities to achieve an across the board performance improvement, compared to individual modalities. In particular, we have demonstrated how audio is complementary to appearance and motion for a number of classes; and the pre-eminence of appearance for noun (rather than verb) classes. The performance of TBN significantly exceeds TSN trained on the same data; and provides state-of-the-art results on the public EPIC-Kitchens leaderboard.
Further avenues for exploration include a model that learns to adjust TBWs over time, as well as implementing class-specific temporal binding windows.
Acknowledgements Research supported by EPSRC LOCATE (EP/N033779/1), GLANCE (EP/N013964/1) & Seebibyte (EP/M013774/1). EK is funded by EPSRC Doctoral Training Partnership, and AN by a Google PhD Fellowship.
Audio visual scene-aware dialog (avsd) track for natural language generation in dstc7.In DSTC7 at AAAI2019 Workshop, 2018.
Action recognition with coarse-to-fine deep feature integration and asynchronous fusion.AAAI, 2018.
AAAI Conference on Artificial Intelligence, 2018.
First person action recognition using deep learned descriptors.In CVPR, 2016.
We show selected qualitative results on a held-out validation set, from the publicly available training videos. We hold-out 14 (untrimmed) videos from the training set, for qualitative examples. Video can be watched at https://www.youtube.com/watch?v=VzoaKsDvv1o. For each, we show the ground truth, and the predictions of individual modalities (RGB, Flow, Audio) compared with our TBN (Single Model).
In Fig. 8, we show results for Ours (TBN, Single Model) and Ours (TBN, Ensemble), as they appeared on the public leaderboard of the EPIC-Kitchens - Action recognition challenge on CodaLab at the time of submission (March 22nd 2019). As noted in the paper, the single model TBN outperforms all other submissions by a clear margin, on both test sets S1 and S2, and the results are further improved using an ensemble of TBNs trained with different TBW widths.
As the challenge concluded, our model (TBN_Ensemble) is ranked 2nd in the leaderboard. A snapshot of the leaderboard for the 2019 challenge is available at
A complete version of Fig 5 is available in Fig 9. It shows the confusion matrices without and with the utilisation of audio for the largest-15 verb and noun classes (in S1). The first confusion matrix show TBN (RGB+Flow), and the second shows TBN (RGB+Flow+Audio). Studying the difference (Fig 5 right) clearly demonstrates an increase (blue) in confidence along the diagonal, and a decrease (red) in confusion elsewhere.
Table 5 shows a comparison of the performance of the top largest classes against the less represented classes, for individual modalities, and our proposed TBN. The classes are ranked by the number of examples in training, and the results are reported separately for the top- classes versus the rest which we refer to as tail classes. The effect of fusion is more evident on the tail classes, improvement on tail vs. improvement on top- for verbs, and improvement on tail vs. improvement on top- for nouns. This finding shows that fusion in TBN decreases the effect of the class-imbalance. Furthermore, it is important to note that audio outperforms RGB and flow on the tail verbs.
In Tables 6 and 7, we show per-class accuracies on S1, on selected verbs and nouns, respectively. We arrange the chosen set of verbs and nouns in three main categories: top: TBN outperforms the best individual modality, mid: TBN performs comparably with the best modality, and bottom: TBN performs worse than the best individual modality. We shade the rows reflecting these three groups in the order mentioned above.
A few conclusions could be made from these tables about the advantages of the proposed mid-level fusion:
Fusion can improve results when all modalities are individually performing well for both verb and noun classes (e.g. ‘open’, ‘fridge’), as well as when all modalities are under-performing (e.g. ‘scoop’, ‘salad’).
Fusion can though be difficult at times, particularly when two of the three modalities are uninformative (e.g. ‘divide’, ‘fish’).
All nouns for which audio is outperforming other modalities have distinct sounds (e.g. ‘switch’, ‘paper’).
Similarly, audio is least distinctive when the noun does not have a sound per se or its sound depends on the action (e.g. ‘chicken’, ‘salt’).
Python code of our TBN model, and pre-trained model on EPIC-Kitchens is available at