How Much Does Audio Matter to Recognize Egocentric Object Interactions?

by   Alejandro Cartas, et al.

Sounds are an important source of information on our daily interactions with objects. For instance, a significant amount of people can discern the temperature of water that it is being poured just by using the sense of hearing. However, only a few works have explored the use of audio for the classification of object interactions in conjunction with vision or as single modality. In this preliminary work, we propose an audio model for egocentric action recognition and explore its usefulness on the parts of the problem (noun, verb, and action classification). Our model achieves a competitive result in terms of verb classification (34.26 benchmark with respect to vision-based state of the art systems, using a comparatively lighter architecture.



There are no comments yet.


page 2

page 3

page 4


Seeing and Hearing Egocentric Actions: How Much Can We Learn?

Our interaction with the world is an inherently multimodal experience. H...

Making the Invisible Visible: Action Recognition Through Walls and Occlusions

Understanding people's actions and interactions typically depends on see...

A self-organizing neural network architecture for learning human-object interactions

The visual recognition of transitive actions comprising human-object int...

Skeleton-Based Mutually Assisted Interacted Object Localization and Human Action Recognition

Skeleton data carries valuable motion information and is widely explored...

H+O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions

We present a unified framework for understanding 3D hand and object inte...

Understanding human-human interactions: a survey

Many videos depict people, and it is their interactions that inform us o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human experience of the world is inherently multimodal [5, 3]

. We employ different senses to perceive new information both passively and actively when we explore our environment and interact with it. In particular, object manipulations almost always have an associate sound (e.g. open tap) and we naturally learn and exploit these associations to recognize object interactions by relying on available sensory information (audio, vision or both). Recognizing object interactions from visual information has a long story of research in Computer Vision


. Instead, audio has comparatively been little explored in this context and most works have been focused on auditory scene classification

[10, 14, 15, 16]. Only more recently, audio has been used in conjunction with visual information for scene classification [2] and object interactions recognition [1, 12].

All works previous to the introduction of Convolutional Neural Networks (CNNs) for audio classification

[10, 14, 15, 16] shared a common pipeline consisting in first extracting the time-frequency representation from the audio signal, such as the mel-spectrogram [10]

, and then classifying it with methods like Random Forests


or Support Vector Machines (SVMs)

[16]. More recently Rakotomamonjy and Gasso [15] proposed to work directly on the image of the spectrogram instead of its coefficients to extract audio features. More specifically, they used histogram of gradients (HOGs) of the spectrogram image as features, on the top of which they applied a SVM classifier. The idea of using the image spectrogram as input to a CNN to learn features in a end-to-end fashion was firstly proposed in [13].

Figure 1: Audio-based classification overview.
Video segments time duration stats (secs) Min: 0.5 Mean: 3.38 Median: 1.78 Std. Dev.: 5.04 Mode: 1.0 Max: 145.16
Figure 2: Statistics and histogram of the time duration of the video segments in the EPIC Kitchens dataset training split with a bin size of half a second.

Later on, Owens et al.[12] presented a model that takes as input a video without audio capturing a person scratching or touching different materials, and generates the emitting synchronized sounds. In their model, the sequential visual information is processed by a CNN (AlexNet [9]

) connected to a Long Short-Term Memory (LSTM)

[8], and the problem is treated as a regression of an audio cochleagram. Aytar et al. [2]

proposed to perform auditory scene classification by using transfer learning from visual CNNs in an unsupervised fashion. In

[6], a two-stream neural network learns semantically meaningful words and phrases at the spectral feature level from a natural input image and its spoken captions without relying on speech recognition or text transcriptions.

Given the indisputable importance of sounds in human and machine perception, this work aims at understanding the role of audio in egocentric action recognition, and in particular to unveil when audio and visual features provide complementary information in the specific domain of egocentric object interaction in a kitchen.

2 Audio-based Classification

Our model is a VGG-11[17] neural network that takes as input an audio spectrogram. This spectrogram only considers the first four seconds of a short video segment. To determine such time interval, we calculated the time duration statistics of the video segments in the filtered training split. We also calculated its time duration histogram and cumulative percentage as shown in colors black and red, respectively in Fig. 2. As it can be observed from the threshold line in yellow, setting the audio time size window equal to 4 seconds, allows to completely cover

of all video segments in one window. When the video segment has a time duration of less than four seconds, then a zero padding is applied on its spectrogram.

Top-1 Accuracy Top-5 Accuracy Avg Class Precision Avg Class Recall


Chance/Random 12.62 1.73 00.22 43.39 08.12 03.68 03.67 01.15 00.08 03.67 01.15 00.05
TSN (RGB) 45.68 36.80 19.86 85.56 64.19 41.89 61.64 34.32 09.96 23.81 31.62 08.81
TSN (FLOW) 42.75 17.40 09.02 79.52 39.43 21.92 21.42 13.75 02.33 15.58 09.51 02.06
TSN (FUSION) 48.23 36.71 20.54 84.09 62.32 39.79 47.26 35.42 10.46 22.33 30.53 08.83
Ours 34.26 08.60 03.28 75.53 25.54 11.49 12.04 02.36 00.55 11.54 04.56 00.89


Chance/Random 10.71 01.89 00.22 38.98 09.31 03.81 03.56 01.08 00.08 03.56 01.08 00.05
TSN (RGB) 34.89 21.82 10.11 74.56 45.34 25.33 19.48 14.67 04.77 11.22 17.24 05.67
TSN (FLOW) 40.08 14.51 06.73 73.40 33.77 18.64 19.98 09.48 02.08 13.81 08.58 02.27
TSN (FUSION) 39.40 22.70 10.89 74.29 45.72 25.26 22.54 15.33 05.60 13.06 17.52 05.81
Ours 32.09 08.13 02.77 68.90 22.43 10.58 11.83 02.92 00.86 10.25 04.93 01.90

Table 1: Performance comparison with EPIC Kitchens challenge baseline results. See text for more information.

We extracted the audio from all video segments using a sampling frequency of KHz222We used FFMPEG for the audio extraction, as it covers most of the band audible to the average person [7]

. Since the audio from the videos have two channels, we joined them in one signal by computing their mean value. From this signal, we computed its short-time Fourier transform (STFT)

[11], as we are interested in noises rather than in human voices. The STFT used a Hamming window of length equal to roughly 30 ms with a time overlapping of . For convenience of the input size of our CNN, we used a sampling frame length of 661. The spectrogram was equal to the squared magnitude of the STFT of the signal. Subsequently, we obtained the logarithm value of the spectrogram in order to reduce the range of values. Finally, all the spectrograms were normalized. The size of the input spectrogram image is .

3 Experiments

The objective of our experiments was to determine the classification performance by leveraging the audio modality on an egocentric action recognition task. We used the EPIC Kitchens dataset [4] on our experiments. Each video segment in the dataset shows a participant doing one specific cooking related action. A labeled action consists of a verb plus a noun, for example, “cut potato” or “wash cup”. We used the accuracy as a performance metric in our experiments. Moreover, as a comparison with using only visual information, we show the results obtained in the official test split of the dataset.


The EPIC Kitchens dataset includes 432 videos egocentrically recorded by 32 participants in their own kitchens while cooking/preparing something. Each video was divided into segments in which the person is doing one specific action (a verb plus a noun). The total number of verbs and nouns categories in dataset is 125 and 352, correspondingly. Currently, this dataset is been used on an egocentric action recognition challenge 333 Thus, we only used the labeled training split and filtered it accordingly to [4], i.e. only using the verbs and nouns that have more than 100 instances in the split. This results in 271 videos having 22,018 segments, and 26 verb and 71 noun categories. The resulting distribution of action classes is highly unbalanced.

Figure 3:

Normalized verb confusion matrix.

Verb Noun Action
Chance/Random 13.11% 2.48% 0.75%
Ours (Top-1) 39.39% 13.41% 10.16%
Ours (Top-5) 81.99% 35.60% 26.36%
Table 2: Action recognition accuracy for all our experiments.
Figure 4: Normalized noun confusion matrix.

Dataset split

For the different combinations of experiments (verb, noun, and action), we divided the given labeled data into training, validation, and test splits considering all the participants. In all of the combinations, the data proportions for the validation and test splits were 10% and 15%, respectively. For the verb and noun experiments, the splits were obtained by randomly stratifying the data because each category has more than 100 samples. In the case of the action experiment, the imbalance of the classes were considered for the data split as follows. At least one sample of each category was put in the training split. If the category had at least two samples, one of them went to the test split. The rest of the categories were randomly stratified on all splits.


For all our experiments we used the stochastic gradient descent (SGD) optimization algorithm to train our network. We used a momentum and a batch size equal to 0.9 and 6, correspondingly. The specific learning rates and number of training epochs for each experiment were: for

verb were during 79 epochs; for noun were during 129 epochs; and for action were during 5 epochs.

Results and discussion

The accuracy performance for all experiments is shown in Table 2. Additionally, it also presents random classification accuracy baseline from the dataset splits described above. We calculated the random classification accuracy of categories as


where and

are the occurrence probability for class

in the train and test splits, accordingly. As means for comparison, we also present the tests results on the seen (S1) and unseen (S2) splits from the EPIC Kitchens challenge in Table 1. This results were obtained using the trained networks for the experiments on verb and noun.

The overall results indicate a good performance using audio alone for verb classification. Additionally, the models fail to recognize categories that do not produce sound such as flip (verb) and heat (noun), as seen on the confusion matrices in Fig. 3 and 4. In the case of verbs, the model also fails on conceptually closed verbs like the pairs turn-on/open and turn-off/close. In the case of noun classification, the model incorrectly predicts objects that have similar materials, for example, the categories can and pan are metallic.

We consider that an object may have different sounds depending on how it is manipulated, and this may help to better discriminate the verb performed on the object that may result visually ambiguous from an egocentric perspective. For instance, knifes and peelers are visually similar objects that could led to an action misclassification on verbs cut and peel used on nouns like carrot, but their sounds were mostly correctly classified as seen in the confusion matrices in Fig. 3 and 4.

4 Conclusion

We presented an audio based model for egocentric action classification trained on the EPIC Kitchens dataset. We analyzed the results on the splits we made from the training set and made a comparison with the visual test baseline. The obtained results show that audio alone achieves a good performance on verb classification (34.26% accuracy). This suggests that audio could complement visual sources on the same task in a multimodal manner. Further work will directly focus on this research line.


  • [1] R. Arandjelović and A. Zisserman. Look, listen and learn. In IEEE International Conference on Computer Vision, 2017.
  • [2] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems, 2016.
  • [3] Ruth Campbell. The processing of audio-visual speech: empirical and neural bases. Philosophical Transactions of the Royal Society B: Biological Sciences, 363(1493):1001–1010, 2007.
  • [4] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. In European Conference on Computer Vision (ECCV), 2018.
  • [5] Francesca Frassinetti, Nadia Bolognini, and Elisabetta Làdavas. Enhancement of visual perception by crossmodal visuo-auditory interaction. Experimental brain research, 147(3):332–343, 2002.
  • [6] David F. Harwath, Antonio Torralba, and James R. Glass. Unsupervised learning of spoken language with visual context. In NIPS, 2016.
  • [7] Henry Heffner and Rickye Heffner. Hearing ranges of laboratory animals. Journal of the American Association for Laboratory Animal Science : JAALAS, 46:20–2, 02 2007.
  • [8] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
  • [9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
  • [10] David Li, Jason Tam, and Derek Toub.

    Auditory scene classification using machine learning techniques.

    AASP Challenge, 2013.
  • [11] Brian McFee, Colin A. Raffel, Dawen Liang, Daniel Patrick Whittlesey Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in python. 2015.
  • [12] Andrew Owens, Phillip Isola, Josh H. McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. Visually indicated sounds.

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 2405–2413, 2016.
  • [13] K. J. Piczak. Environmental sound classification with convolutional neural networks. In 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6, Sep. 2015.
  • [14] Karol J. Piczak. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM International Conference on Multimedia, MM ’15, pages 1015–1018, New York, NY, USA, 2015. ACM.
  • [15] A. Rakotomamonjy and G. Gasso. Histogram of gradients of time–frequency representations for audio scene classification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(1):142–153, Jan 2015.
  • [16] Gerard Roma, Waldo Nogueira, and Perfecto Herrera. Recurrence quantification analysis features for auditory scene classification. 2013.
  • [17] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
  • [18] Carlos Velasco, Russ Jones, Scott King, and Charles Spence. The sound of temperature: What information do pouring sounds convey concerning the temperature of a beverage. Journal of Sensory Studies, 28(5):335–345, 2013.
  • [19] Hong-Bo Zhang, Yi-Xiang Zhang, Bineng Zhong, Qing Lei, Lijie Yang, Ji-Xiang Du, and Duan-Sheng Chen. A comprehensive survey of vision-based human action recognition methods. Sensors, 19(5), 2019.