Audio-Visual Model Distillation Using Acoustic Images

04/16/2019 ∙ by Andrés F. Pérez, et al. ∙ 6

In this paper, we investigate how to learn rich and robust feature representations for audio classification from visual data and a novel audio data modality, namely acoustic images. Former models learn audio representations from raw signals or spectral data acquired by a single microphone, with remarkable results in classification and retrieval. However, such representations are not so robust towards variable environmental sound conditions. We tackle this drawback by exploiting a new multimodal labeled action recognition dataset acquired by a hybrid audio-visual sensor that provides RGB video, raw audio signals, and spatialized acoustic data, also known as acoustic images, where the visual and acoustic images are aligned in space and synchronized in time. Using this richer information, we train audio deep learning models in a teacher-student fashion. In particular, we distill knowledge into audio networks from both visual and acoustic image teachers. Our experiments suggest that the learned representations are more powerful and have better generalization capabilities than the features learned from models trained using just visual or single-microphone audio data.



There are no comments yet.


page 1

page 4

page 12

page 13

page 14

Code Repositories


Code for the paper: Audio-Visual Model Distillation Using Acoustic Images

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Humans experience the world through a number of simultaneous sensory observation streams. The co-occurrence of these streams provides a useful learning signal to understand the environment surrounding us [12]

. There is in fact evidence that audio-visual mirror neurons play a central role in the recognition of actions given their temporal synchronization

[4]. Furthermore, it was found that many neurons with receptive fields spatially aligned across modalities show a super-additive response to coincident and co-localized multimodal stimulations [45].

In this paper, motivated by these findings, we investigate whether and how visual and acoustic data synchronized in time and aligned in space

can be exploited for scene understanding. We take advantage of a recent audio-visual sensor, called DualCam, composed by an optical camera and a 2D planar array of microphones (see Figure 

3), able to provide spatially localized acoustic data aligned with the corresponding optical image (see Figure 1, right) [48]. Specifically, by combining the raw signals acquired by 128 microphones (by beamforming [44]), this sensor is able to output an acoustic image where each pixel represents the imprint of the sound coming from the corresponding pixel location in the optical image. Using this sensor, we generate a new multimodal dataset depicting different subjects performing several actions in multiple scenarios. By exploiting spatialized audio information coupled to the related visual data and designing suitable multimodal deep learning models, we aim at generating more discriminant and robust features, likely resulting in a better description of the scene content for robust audio classification. Figure 1 shows the multispectral acoustic image used as input data, which has 512 frequency bins and an example of visualization of an acoustic image overlaid upon an optical image.

Figure 1: Left: multispectral acoustic image volume associated to the audio content of the sensed scene. It has two spatial dimensions (aligned with the visual image space) and a frequency axis of 512 bins that cover the sensor’s audible range. Each image in the volume represents the spatial audio information associated to each frequency bin. Right: visualization (as heat color map) of an acoustic image formed by summing the energy of every frequency bin between and for each spatial location, overlaid on the corresponding video frame. The spatial location of the audio signal with the highest intensity is identified in red.

The idea of leveraging the co-occurrence of visual and audio events as supervisory signal is not new. Former approaches in the pre deep-learning era combined visual and auditory signals in rather simplistic ways. For instance, in [46]

a neural network was trained to predict the auditory signal given the visual input. A particularly relevant earlier work is


, which introduced a self-supervised learning algorithm for jointly training audio and visual networks by minimizing codebook disagreement. Another interesting work is

[23], which presented an algorithm based on canonical correlation analysis (CCA) to detect pixels associated to the sound, while filtering out other dynamic (but silent) pixels.

Several recent works address audio-related tasks such as natural sound recognition [26], speech separation and enhancement [1, 10], audio event classification or sound source localization [35, 42], either by directly modeling raw audio signals with 1D convolutions [5, 31, 37] or, most popularly, by modeling intermediate sound representations such as spectrograms or cochleograms [2, 3, 33, 32, 38, 39, 40, 47]. Nevertheless, none of the past works tried to exploit spatially localized acoustic data to assess the potentialities of such richer information source.

In our work, we claim that it is possible to train audio deep learning models to face an action recognition problem in a more robust way across different scenarios utilizing a teacher-student framework able to distill knowledge [11, 28] from state-of-the-art vision network models and from a novel architecture that operates on the spatialized acoustic data. Similarly to [30], our intuition is to learn better features for a given modality assuming the availability of other complementary modalities at training time. We leverage video and multispectral acoustic image sequences aligned in space/time as side information at training, and predict actions given only a raw audio signal acquired by a single microphone at test, in a cross-scenario setting, where the environmental noise conditions are significantly different. Current methods, even best deep learning models, lead to very low classification accuracies [14, 29] in such conditions.

Hence, in essence, in this work we try to answer the following question: Does spatialized data allow to learn more discriminant features for single-microphone audio classification? In this respect, our main contributions can be summarized as follows.

  1. We propose a thorough study to assess whether visual and acoustic data aligned in space and synchronized in time bring advantage for single-microphone audio classification.

  2. We introduce a new multimodal dataset consisting in 14 action classes, in which acoustic and visual data are spatially aligned. This type of multi-sensory data has no counterpart in the literature and may lead to further studies by the scientific community.

  3. We develop a deep teacher-student model to deal with such new data, showing that it is indeed possible to extract semantically richer representations for improving audio classification from single microphone. In particular, we distill knowledge learned from spatialized audio-visual modalities to a single-microphone model.

It is worth to note that we are the first to propose an algorithm in which the transfer of knowledge involves teacher models considering 2 different modalities (2D audio and 2D visual data) and the student model is devised for a different modality (1D audio signal), when typically the student deals with the task of one of the teacher model.

We validate our approach 1) on the proposed action dataset, and 2) by transferring learned representations on a standard sound classification benchmark dataset, demonstrating remarkable capabilities, especially in cross-dataset conditions and the usefulness of distillation for cross-scenario learning.

The remainder of this paper is organized as follows. We first discuss the related work in Section 2, mainly focusing on audio-visual models and benchmark datasets. In Section 3, we describe our new acquired multimodal action dataset, and in Section 4, we describe acoustic image pre-processing and we propose the network architecture to deal with acoustic images. In Section 5, we present our distillation-based approach to deal with multispectral acoustic data, and in Section 6, we extensively validate our proposed framework by devising a set of experiments in order to assess the soundness of the learned representations. Finally, we draw conclusions in Section 7.

2 Related Work

We briefly review related work in the areas of multimodal learning, video and sound self-supervision, and transfer learning. We also review already existing audio and audio-visual datasets.

Multimodal learning. Multimodal learning concerns relating information from multiple data modalities, such data provides complementary semantic information due to correlations in between them [30]. We consider the cross-modality learning setting, in which data from multiple modalities is available only during training, while only data from a single modality is provided at testing phase. In [6, 7] the authors learn shared representation from aligned data and use them for cross-modal retrieval but using different data modalities. [6] for instance considers three major natural modalities: vision, sound and language, while [7] considers five weakly aligned modalities: natural images, sketches, clip art, spatial text, and descriptions. Other works such as [11, 20] utilize RGB video images and depth information to learn feature representations through modality hallucination. In our case instead, we consider RGB video images, raw audio, and acoustic images for training phase, and only raw audio at testing phase.

Video and sound self-supervision. There has been increased interest in using deep learning models for multimodal fusion of auditory and visual signals to improve the performance of visual models or solve various speech-related problems, such as speech separation and enhancement.

First approaches trained single networks on one modality using the other one to derive some sort of supervisory signal [5, 17, 33, 32, 34]. For example [5, 17] train an audio network to correlate with visual outputs using pre-trained visual networks as a teacher. Others such as [32, 33] train a visual network to generate sounds by solving a regression problem consisting in mapping a sequence of video frames to a sequence of audio features. In [34] instead, they learn visual models using ambient sounds as scene labels.

More recent works [2, 3, 9, 31, 38] train both visual and audio networks aiming at learning multimodal representations useful for many applications, such as cross-modal retrieval, speech separation, sound source localization, action recognition, and on/off-screen audio source separation. For instance in [2, 3] they learn aligned audio-visual representations, using an audio-visual correspondence task. In [31] they train an early-fusion multisensory network to predict whether video frames and audio are temporally aligned. In [38] they train a two-stream network structure utilizing an attention mechanism guided by sound information to localize the sound source.

They key factor in all these works is that they exploit the natural synchronization between auditory and visual signals by training in a self-supervised manner. Although we still address our problem in a semi-supervised manner, we notice that the natural spatial alignment and time synchronization of the data produced by the DualCam sensor opens the door to also train models through self-supervision.

Transfer learning. Our work is strongly related to transfer learning which deals with sharing information from one task to another. In particular we transfer knowledge between networks operating on different data modalities (see Section 5). We perform transferring with the aid of the generalized distillation framework which proposes to use the teacher-student approach from the distillation theory to extract knowledge from a privileged information source [28], also called a teacher. In our case the privileged information leveraged at training time is represented by the additional modalities, i.e. video and acoustic images. A rather simple transfer mechanism is that of [5]

which proposes a teacher-student self-supervised training procedure based on the Kullback-Leibler divergence to transfer knowledge from a vision model into sound modality using unlabeled video as a bridge. This mechanism resembles the generalized distillation framework, however they only rely on the teacher soft labels which are in general less reliable than hard labels. An interesting work is

[20] which introduces a novel technique for incorporating additional information, in the form of depth images, at training time to improve test time RGB only detection models. We draw inspiration from [11] which addresses action recognition by distilling knowledge from a depth network into a vision network. They accomplish this by training a hallucination network [20] that learns to distill depth features. It is worth noticing that although [11] works with different data modalities, it is the closest to ours since they transfer knowledge with the aid of the generalized distillation framework.

Audio-visual datasets. Due to recent interest in audio-visual and multimodal learning, several audio and audio-visual datasets have emerged. We summarize some of the most prominent ones aiming at locating our multimodal actions dataset within the currently existing corpus. AudioSet [13] consists of an expanding ontology of 632 audio event classes and a collection of 2.084.320 human-labeled 10 second sound clips drawn from YouTube videos. It contains a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds. Flickr-SoundNet [5] is a large unlabelled dataset of completely unconstrained videos from Flickr, compiled by searching for popular tags, but no tags or any sort of additional information apart from the videos themselves are used. It contains over 2 million videos. Kinetics-Sounds [2] comprises a subset of the Kinetics dataset [22], which contains YouTube videos manually annotated for human actions, and cropped to 10 seconds around the action. The subset contains 19k video clips formed by filtering the Kinetics dataset for 34 human action classes, which have been chosen to be potentially manifested visually and aurally. Environmental Sound Classification (ESC-50) [36] is a labeled collection of 2.000 environmental audio recordings manually extracted from Freesound. It consists of 5 seconds long recordings organized into 50 semantical classes loosely arranged into five major categories: animals, natural soundscapes & water sounds, human non-speech sounds, interior/domestic sounds, and exterior/urban noise. Detection and Classification of Acoustic Scenes and Events (DCASE) [29] is a dataset consistent of of recordings from various acoustic scenes. It was recorded in six large European cities, in different locations for each scene class. For each recording location there are 5 to 6 minutes of audio split into segments of 10 seconds. Similarly to these works, we propose a multimodal dataset for action recognition which captures data in more realistic conditions, in order to allow the learning of more discriminative features that better cope with tasks such as sound classification.

3 Audio-Visually Indicated Action Dataset

We introduce a new multimodal dataset comprised of visual data as RGB image sequences and acoustic data as raw audio signals acquired from 128 microphones. The latter signals, opportunely combined by a beamforming algorithm, compose a multispectral acoustic image volume, which is aligned in space and time with the optical images (see Figure 1). The following 14 actions were chosen:

  1. Clapping

  2. Snapping fingers

  3. Speaking

  4. Whistling

  5. Playing kendama

  6. Clicking

  7. Typing

  8. Knocking

  9. Hammering

  10. Peanut breaking

  11. Paper ripping

  12. Plastic crumpling

  13. Paper shaking

  14. Stick dropping

For the acquisition, we acknowledge the participation of 9 people performing the aforementioned actions recorded in three different scenarios, with increasing and varying noise conditions, namely, an anechoic room, an indoor open space area, and a terrace outdoor. We name them scenario 1, 2, and 3, respectively. In our dataset, the same action is performed by different subjects in distinct places, so allowing to show the equivariance properties of the multispectral acoustic images across subjects, scenarios and position in the scene, which are exploited when learning audio features from an acoustic teacher model. In the end, the dataset consists of 378 audio-visual video sequences (27 per action) between 30 and 60 seconds depicting different people individually performing a set of actions producing a characteristic sound in each scenario. Figure 2 shows representative samples of our dataset for the 3 considered scenarios.

Figure 2: Three examples of Audio-Visual Indicated Actions dataset represented as video frame, acoustic image visualization overlaid on the frame, and raw waveform (from a single microphone). (a) Speaking in anechoic room. (b) Hammering in the indoor open space area. (c) Playing Kendama in the terrace.

We acquired the dataset using the DualCam acoustic-optical camera described in [48]. The sensor captures both audio and video data using a planar array of 128 low-cost digital MEMS microphones located according to an optimized aperiodic layout, and a video camera placed at the device center as depicted in Figure 3.

Figure 3: DualCam acoustic-optical camera.

The device is capable of acquiring audio data in the range and audio-video sequences at a frame rate of 12 frames per second (fps). The camera has a maximum field of view of 90°in elevation and 360°in azimuth. In our acquisition setup the camera was always static looking at the scene, while the subjects moved around but always within its field of view. Finally, the acoustic image resolution111Measured at . provided by the sensor is of 5°at . During the acquisitions, the subjects were always within 2 meters from the device so that the provided acoustic resolution results adequate.

After collecting the dataset, audio and video data had to be synchronized since they were acquired in an interleaved way at different frame rates.

The data provided by the sensor consists in RGB video frames of pixels, raw audio data from 128 microphones acquired at a frequency of 12 kHz, and multispectral acoustic images obtained from the raw audio signals of all the microphones using beamforming, which summarize the per-direction audio information in the frequency domain. This means that each acoustic pixel corresponds to 13,3 visual pixels, in fact acoustic resolution is lower than optical one. Among the raw audio waveforms, we choose the one of just one microphone for testing single microphone audio networks.

4 Learning with Acoustic Images

In this section, we describe acoustic images representation, their pre-processing and the network architecture we proposed for modelling this novel type of data.

Acoustic Images Pre-processing. Multispectral acoustic images are generated with the frequency implementation of the filter-and-sum beamforming algorithm [44], aimed at producing a volume of size , with 512 channels corresponding to the frequency bins which represent the frequency information. Full details of the algorithm can be found in [48].

Handling input acoustic images with 512 channels is a computationally expensive task and typically the majority of information in our dataset is contained in the low frequencies. Consequently, we decided to compress the acoustic images using Mel-Frequency Cepstral Coefficients (MFCC), which consider audio human perception characteristics [41]. Therefore, we compute 12 MFCC, going from from -D volumes to -D volumes, retaining the most important information and reducing consistently the computational complexity and the memory footprint.

DualCamNet Architecture. Acoustic images provide a small temporal support which is generally not enough for discriminating information over time intervals of several seconds. For this reason, we feed to our network a set of 12 consecutive acoustic images corresponding to 1 second of audio data. We deem that 1 sec of acoustic images is a reasonable tradeoff between sound information content and processing cost.

In order to train a model able to discriminate information from acoustic images, we explicitly model both the spatial and the temporal relationships among them. To this end, we propose the architecture structure shown in Figure 4 which utilizes 3D convolutions as commonly done in visual action recognition [43], where the spatial and temporal convolutions are decoupled.

We follow the LeNet [25] design style, with convolutional filters, and max-pooling layers with stride 1 and zero-padding to keep the spatial resolution. The network includes 3 blocks of convolutional layers plus a block of 3 fully convolutional layers which produces the output prediction.

The first block consists of a single 1D convolutional layer over time followed by a ReLU nonlinearity. The aim of this layer is modeling the temporal relationship of consecutive acoustic images by aggregating them. In particular, we apply a filter of size 7 with stride 1 and zero-padding to keep the temporal resolution. We experimented with several filters sizes finding 7 to be the best one.

The second and third blocks model the spatial equivariance of the acoustic images and consist of a 2D convolutional layer followed by max-pooling. We go from the 12 channels of the input to 32 channels and then double it to 64. Each convolutional layer is followed by batch normalization

[21] and ReLU nonlinearity.

The final block comprises 3 fully convolutional layers with ReLU in between. It converts the input feature map into a 14-D classification vector as output, namely the predicted class probabilities, using intermediate features size of 1024-D and 1000-D.

This model will be used as teacher network in our validation experiments.

(a) DualCamNet
(b) HearNet
(c) OursSoundNet
Figure 4: Our proposed networks. (a) DualCamNet architecture, used as teacher model. (b) HearNet architecture, used as student model. (c) OursSoundNet architecture, used as student model.

5 Model Distillation

In this section, we describe the utilized network architectures and the knowledge transfer procedure.

5.1 Architectures

Similarly to [11], we utilize data from multiple modalities at training phase, and only data from a single modality at testing phase. In contrast to their work in which representations are learnt from depth and RGB videos while relying on RGB data only at test time, we leverage either RGB video images or multispectral acoustic images in training as side information, and we test only on audio data from a single microphone.

We want to emphasize here that, to the best of our knowledge, this is the first time that model distillation is performed from modalities different from that utilized in testing. Specifically, we train on 2-dimensional spatialized audio and video data, to improve accuracy on a model working on mono-dimensional audio signals only as input. As further original aspect, [11] trains one ResNet-50 [18] network per stream, while we use different network architectures for each stream of our model.

Teacher Networks. For the visual stream, we experimented two models, ResNet-50 [18] and its variation including 3D temporal convolutions introduced in [11]

, here called Temporal ResNet-50. We choose ResNet-50 over other ImageNet-trained CNNs as it provides a good compromise between network size and accuracy. On the other hand, Temporal ResNet-50 stands as a strong action recognition model dealing with action dynamics with the aid of temporal connections between residual units. It has also been selected since it constitutes a powerful baseline model to compare with. DualCamNet, explained earlier in Section 

4, will be used as teacher model as well in the following.

Student Networks. Regarding the raw audio waveform stream, we experimented two models that capture different characteristics of audio data. The first one is SoundNet [5], which operates over time domain signals. We preferred the 5-layer version over the 8-layer one, as our dataset is not big enough to allow SoundNet to grasp the underlying data patterns. We used the exact same architecture described in [6], adding 3 fully convolutional layers at the bottom of the network with 1024, 1000 and 14 filters, respectively. To avoid further confusion, we named our version OurSoundNet.

The second model is a network based on the sound sub-network presented in [6], called from here on, HearNet. Its architecture is shown in Figure 4. This network operates on amplitude spectrograms obtained from an audio waveform of 5 seconds, upsampled to . Such spectrogram was produced by computing the STFT 222

Short-Time Fourier Transform

considering a window length of with half-window overlap. This produces 500 windows with 257 frequency bands. The resulting spectrogram is interpreted as a 257-dimensional signal over 500 time steps.

HearNet processes spectrogram with 4 1D convolutions using kernel sizes 11, 5, 3, and 128, 256, 256 filters, respectively, with stride 1. The last convolutional layers are fully convolutional and use 1024, 1024, 1000 and 14 filters to obtain the class predictions. We applied zero-padding in all layers except conv4 in order to keep the spatial resolution. The chosen activation function is ReLU. After each of the first 3 convolutional layers, we downsampled with one-dimensional max-pooling by a factor of 5.

5.2 Training procedure

Following the generalized distillation framework [28], we first learn a teacher function by solving a classification problem and, second, we compute the teacher soft labels . As third step, we distill into by using both the hard and soft labels. The knowledge transfer procedure is graphically illustrated in Figure 5.

Unlike [11], which combines Hinton’s distillation loss [19] with Hoffmans’s hallucination loss [20], we consider only the former one as we are transferring knowledge between completely different network streams. More formally, we distill the teacher learned representation into as follows:

where are the soft labels derived from the teacher about the training data, and are classes of functions described by the teacher and student models [28], respectively, is the softmax operator, and are the ground truth hard labels. The imitation parameter allows to balance the weight of soft labels with respect to the true hard labels . The temperature parameter allows to smoothen the probability vector predicted by the teacher network .

Figure 5: Teacher-student training procedure

6 Experimental Results

Our goal in this paper is to learn feature representations for raw audio data by transferring knowledge across networks operating on different data modalities. To evaluate how well our method addresses this problem we perform two sets of experiments with the twofold objective of 1) showing the improvement brought by distilling knowledge from other networks – audio and visual – in particular from a network trained on acoustic images, and 2) assessing the quality of the learned representations on a standard sound classification benchmark.

6.1 Audio features

In this first set of experiments, we evaluate the performance of the teacher and student networks on the task of action recognition on our dataset. We train these networks following the procedure described in Section 5

using action labels as ground truth. We trained for 100 epochs

333The number of iterations varies with the size of the training set. with batches of 32 elements using the Adam optimizer [24] with learning rate of and, in some cases, of (details are in Supplementary Material). In order to measure the generalization capabilities of the learned representations, we evaluate the accuracy of our trained models on a cross-scenario setting. For each scenario and for the complete dataset, 80% of the data are intended for training and the rest is divided equally between test and validation set. When we train a model on one scenario, we test it on its test set and on the whole data of the other two scenarios. When training on all scenarios, we test on the union of all test sets.

Teachers Networks. First, we train our DualCamNet model and the two proposed visual networks, ResNet-50 and Temporal ResNet-50, as they constitute our baselines. Table 1 shows their performance. We observe that our DualCamNet convincingly outperforms the visual networks in all combination of scenarios. This implies that most of the actions in our dataset are better distinguishable aurally than visually. One possible explanation for this, is that in the majority of the cases the ”object” involved in the action execution, e.g. mouth, mouse or hammer, is not easily visible but has a characteristic sound signature.

A comparison of the two visual networks reveals that they achieve similar results throughout all configurations, indicating that motion is not a key factor to model the actions performed in our dataset. Consequently, we choose ResNet-50 over Temporal ResNet-50 for the rest of the experiments since the former one has a simpler structure.

Additionally, we have designed a hybrid network which combines the output of the DualCamNet and ResNet-50, to check whether modality fusion brings better performance. We do so by concatenating the 1024 feature volumes of the two networks and processing them with two further fully convolutional layers of 1000 and 14 filters, respectively. This network achieves a 7.1% improvement in accuracy with respect to the top performer when training over all scenarios. It is important to note that it also consistently improves the testing accuracy in all cross-scenario configurations (see Table 1, AV column). These findings indicate some benefits brought by modality fusion that can be better explored in future research.

Train set Test set D R T AV
Scenario 1 Scenario 1 0.8470 0.6965 0.7117 0.8775
Scenario 2 0.2938 0.2955 0.2616 0.3490
Scenario 3 0.1471 0.1355 0.1410 0.1528
Scenario 2 Scenario 1 0.2986 0.1918 0.1844 0.3060
Scenario 2 0.7600 0.5838 0.4987 0.7418
Scenario 3 0.1504 0.1486 0.1243 0.2049
Scenario 3 Scenario 1 0.2309 0.1479 0.1571 0.2767
Scenario 2 0.2032 0.1229 0.1063 0.2182
Scenario 3 0.6736 0.2240 0.3013 0.5708
All All 0.7702 0.6335 0.6393 0.8412
scenarios scenarios
Table 1: Test accuracy for teacher models. D: DualCamNet. R: ResNet-50 [18]. T: Temporal ResNet-50 [11]. AV: AVNet.

Student Networks. Before proceeding to the distillation results, we first look at the performance of the two proposed student networks when trained only from the hard labels. Tables 2 and 3 (G column) show the accuracy results for OurSoundNet and HearNet, respectively. It can be observed that both networks perform well, with HearNet achieving a higher accuracy when trained and tested on the same scenario and all scenarios.

The former result is impressive considering that OurSoundNet was trained on the Flickr-SoundNet dataset and fine-tuned from the original SoundNet-5 network up to conv4. HearNet instead was trained from scratch on our dataset. A reasonable explanation for this is that shallow networks such as HearNet perform better under small data regimes.

Train set Test set G D R
Scenario 1 Scenario 1 0.4881 0.6071 0.5238
Scenario 2 0.4114 0.4669 0.4378
Scenario 3 0.1958 0.2844 0.1958
Scenario 2 Scenario 1 0.4339 0.3598 0.4220
Scenario 2 0.3333 0.3810 0.2619
Scenario 3 0.1931 0.1799 0.1786
Scenario 3 Scenario 1 0.3796 0.4352 0.3955
Scenario 2 0.2513 0.3386 0.2725
Scenario 3 0.3690 0.3452 0.2619
All scenarios All scenarios 0.4102 0.5299 0.4145
Table 2: Test accuracy for OurSoundNet trained with distinct supervisory information. G: Ground truth hard labels. D: DualCamNet soft labels. R: ResNet-50 soft labels.
Train set Test set G D R
Scenario 1 Scenario 1 0.6548 0.7857 0.7262
Scenario 2 0.4286 0.4325 0.4960
Scenario 3 0.1627 0.1825 0.2989
Scenario 2 Scenario 1 0.4100 0.5542 0.5106
Scenario 2 0.3214 0.2619 0.4524
Scenario 3 0.1627 0.1825 0.1799
Scenario 3 Scenario 1 0.3307 0.3770 0.4405
Scenario 2 0.2976 0.3056 0.2765
Scenario 3 0.5000 0.6190 0.6071
All scenarios All scenarios 0.6966 0.7009 0.6282
Table 3: Test accuracy for HearNet [6] trained with distinct supervisory information. G: Ground truth hard labels. D: DualCamNet soft labels. R: ResNet-50 soft labels.

Teacher-Student Networks. Finally, we take a look to the student networks trained both from the hard labels and the soft labels obtained from the teacher networks. These results are also shown in Tables 2 and 3, D and R columns.

Overall, distillation from DualCamNet provides higher improvement over ResNet-50. This is consistent with their respective performances when trained from the hard labels. In some exceptional cases, the teacher is able to help the student even though it could not achieve a good accuracy. For instance, ResNet-50 trained on the scenario 3 achieves a 22.4% test accuracy and HearNet on the same setting reaches 50.0% but, when transferring from the ResNet-50 to HearNet the accuracy improves up to 60.71%.

We also observe that whenever we perform training using data from scenario 1, we obtain good generalization. When transferring from DualCamNet, we attribute this improvement to the fact that data acquired in the anechoic room is cleaner than in other scenarios as there is less ambient noise. Similarly, when transferring from ResNet-50, there is also less clutter in the scene thus allowing the network to easily capture the objects involved in the action execution. This is however not true for scenarios 3 and particularly 2, which are considerably more (acoustically) noisy and (visually) cluttered.

We validated the chosen hyper-parameters, and found and to be the best temperature value and imitation parameter, respectively. This means that we keep the teacher predictions unchanged and give them equal importance than to the hard labels. Interestingly our finding about is consistent with that of [11].

In summary, these results show that knowledge distillation is beneficial in all within-scenario and, especially, in cross-scenario settings.

6.2 Transfer learning

Finally, we tested the effectiveness of the networks trained on our dataset by using them as feature extractors on a standard sound classification benchmark, the DCASE2018 dataset [29]. Specifically, we performed k-NN [16]

on the features extracted from DCASE2018 by our student models to verify whether learned representation were general enough to perform well in a different audio domain. Table 

4 reports our findings.

For SoundNet, we employ both features from the penultimate layer (fc1) and the previous one (conv4). As for fc1, we can note that OurSoundNet, distilled from acoustic images, performs better than SoundNet-5/conv5 pre-trained on Flickr-SoundNet dataset by around 8%. This might be caused by the fact that the penultimate layer of the original SoundNet-5 is better at distinguishing other types of classes. For this reason, we also tried other features taken from the preceding layer, conv4, for both models. Nonetheless, also in this case we perform better, although now the gap reduces to only about 2%. This is a nontrivial result, since in many cases fine-tuning an existing model using another dataset causes catastrophic forgetting. Regarding HearNet, we notice that the model distilled from DualCamNet achieves good results on DCASE2018, similar to those of OurSoundNet.

Comparing with the baseline result in [29] and with the best classification results obtained on such dataset by [27] and [15], we also note that our test accuracies are better by a large margin. We point out that we are using our dataset as additional information and this can be the cause of the improved results. Therefore, the features learned with our dataset transfer well to another dataset, as it can be seen for HearNet, which is trained from scratch just on our dataset, but performs very well (from about 10 to 30%) also in extracting features for DCASE2018.

Features Training Dataset Test accuracy
Mesaros et al. [29] DCASE2018 0.597
Liping et al. [27] DCASE2018 0.798
Golubkov et al. [15] DCASE2018 0.801
HearNet/fc1 Ours 0.896
OurSoundNet/fc1 Ours 0.898
OurSoundNet/conv4 Ours 0.906
SoundNet-5/conv5 Flickr-SoundNet 0.821
SoundNet-5/conv4 Flickr-SoundNet 0.884
Table 4: Dataset transfer results for DCASE2018 [29]. Feature extracted by the models distilled from DualCamNet presented in Section 5 are fed into a k-NN [16]classifier. The number of nearest neighbours is validated on the validation set.

7 Conclusions

In this work, we investigate whether and how it is possible to transfer knowledge from visual data and spatialized sound, namely, acoustic images, in order to improve audio classification from single microphone. To this end, we take advantage of a special sensor, DualCam, an acoustic-optical camera that provides in output audio-visual data synchronized in time and spatially aligned. Using this sensor, we acquired a novel audio-visually indicated action dataset in 3 different scenarios, from which we aim at extracting information useful for audio classification.

The peculiar nature of the generated acoustic images synchronized with optical frames, never studied before, led to the design of deep learning models in the context of the teacher-student paradigm, in order to assess if this information was transferable and indeed useful for single-channel audio classification. We highlight here that the proposed teacher-student framework is the first able to distill from 2D visual data and acoustic images to a model taking as input a 1D modality, namely, audio signals.

On a set of experiments, in which we learnt from visual data and acoustic images separately, we found out that the distilled models are effective in the audio classification task, especially in cross-scenario settings. Future work aims at further exploring the capabilities of this sensor for detection, recognition, self-supervised learning, sound source localization and cross-modal retrieval.


  • [1] T. Afouras, J. S. Chung, and A. Zisserman. The conversation: Deep audio-visual speech enhancement. 2018.
  • [2] R. Arandjelovic and A. Zisserman. Look, listen and learn. In

    The IEEE International Conference on Computer Vision (ICCV)

    , Oct 2017.
  • [3] R. Arandjelovic and A. Zisserman. Objects that sound. In The European Conference on Computer Vision (ECCV), September 2018.
  • [4] R. Arrighi, F. Marini, and D. Burr. Meaningful auditory information enhances perception of visual biological motion. Journal of Vision, 9(4):25–25, 2009.
  • [5] Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learning sound representations from unlabeled video. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pages 892–900, USA, 2016. Curran Associates Inc.
  • [6] Y. Aytar, C. Vondrick, and A. Torralba. See, hear, and read: Deep aligned representations. CoRR, abs/1706.00932, 2017.
  • [7] L. Castrejón, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba. Learning aligned cross-modal representations from weakly aligned data. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 2940–2949, June 2016.
  • [8] V. R. DeSa. Learning classification with unlabeled data. In Proceedings of the 6th International Conference on Neural Information Processing Systems, NIPS’93, pages 112–119, San Francisco, CA, USA, 1993. Morgan Kaufmann Publishers Inc.
  • [9] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. CoRR, abs/1804.03619, 2018.
  • [10] A. Gabbay, A. Shamir, and S. Peleg. Visual speech enhancement. 2018.
  • [11] N. C. Garcia, P. Morerio, and V. Murino. Modality distillation with multiple stream networks for action recognition. In The European Conference on Computer Vision (ECCV), September 2018.
  • [12] W. W. Gaver. What in the world do we hear?: An ecological approach to auditory event perception. Ecological Psychology, 5(1):1–29, 1993.
  • [13] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter. Audio set: An ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans, LA, 2017.
  • [14] S. Gharib, K. Drossos, E. Çakir, D. Serdyuk, and T. Virtanen. Unsupervised adversarial domain adaptation for acoustic scene classification. ArXiv e-prints, Aug. 2018.
  • [15] A. Golubkov and A. Lavrentyev.

    Acoustic scene classification using convolutional neural networks and different channels representations and its fusion.

    Technical report, DCASE2018 Challenge, September 2018.
  • [16] G. Guo, H. Wang, D. A. Bell, Y. Bi, and K. Greer. Knn model-based approach in classification. In CoopIS/DOA/ODBASE, 2003.
  • [17] D. Harwath, A. Torralba, and J. Glass. Unsupervised learning of spoken language with visual context. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 1858–1866. Curran Associates, Inc., 2016.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, June 2016.
  • [19] G. E. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. NIPS 2014 Deep Learning Workshop, abs/1503.02531, 2015.
  • [20] J. Hoffman, S. Gupta, and T. Darrell. Learning with side information through modality hallucination. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 826–834, June 2016.
  • [21] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In

    Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37

    , ICML’15, pages 448–456., 2015.
  • [22] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, A. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
  • [23] E. Kidron, Y. Y. Schechner, and M. Elad. Pixels that sound. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 88–95 vol. 1, June 2005.
  • [24] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR 2015, abs/1412.6980, 2015.
  • [25] Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio. Object Recognition with Gradient-Based Learning, pages 319–345. Springer Berlin Heidelberg, Berlin, Heidelberg, 1999.
  • [26] X. Li, V. Chebiyyam, and K. Kirchhoff. Multi-stream network with temporal attention for environmental sound classification. CoRR, abs/1901.08608, 2019.
  • [27] Y. Liping, C. Xinxing, and T. Lianjie. Acoustic scene classification using multi-scale features. Technical report, DCASE2018 Challenge, September 2018.
  • [28] D. Lopez-Paz, L. Bottou, B. Schölkopf, and V. Vapnik. Unifying distillation and privileged information. ICLR 2016, abs/1511.03643, 2016.
  • [29] A. Mesaros, T. Heittola, and T. Virtanen. A multi-device dataset for urban acoustic scene classification. ArXiv e-prints, July 2018.
  • [30] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, pages 689–696, USA, 2011. Omnipress.
  • [31] A. Owens and A. A. Efros. Audio-visual scene analysis with self-supervised multisensory features. In The European Conference on Computer Vision (ECCV), September 2018.
  • [32] A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adelson, and W. T. Freeman. Visually indicated sounds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2405–2413, 2016.
  • [33] A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba. Ambient sound provides supervision for visual learning. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors, Computer Vision – ECCV 2016, pages 801–816, Cham, 2016. Springer International Publishing.
  • [34] A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba. Learning sight from sound: Ambient sound provides supervision for visual learning. International Journal of Computer Vision, 126(10):1120–1137, Oct 2018.
  • [35] S. Parekh, S. Essid, A. Ozerov, N. Q. K. Duong, P. Perez, and G. Richard. Weakly supervised representation learning for unsynchronized audio-visual events. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018.
  • [36] K. J. Piczak. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM International Conference on Multimedia, MM ’15, pages 1015–1018, New York, NY, USA, 2015. ACM.
  • [37] M. Ravanelli and Y. Bengio. Speaker recognition from raw waveform with sincnet. ArXiv e-prints, July 2018.
  • [38] A. Senocak, T.-H. Oh, J. Kim, M.-H. Yang, and I. So Kweon. Learning to localize sound source in visual scenes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [39] A. Senocak, T.-H. Oh, J. Kim, M.-H. Yang, and I. So Kweon. On learning association of sound source and visual scenes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018.
  • [40] N. Takahashi, M. Gygli, and L. V. Gool. Aenet: Learning deep audio features for video analysis. IEEE Transactions on Multimedia, 20(3):513–524, March 2017.
  • [41] H. Terasawa, M. Slaney, and J. Berger. A statistical model of timbre perception. In SAPA@INTERSPEECH, 2006.
  • [42] Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu. Audio-visual event localization in unconstrained videos. In The European Conference on Computer Vision (ECCV), September 2018.
  • [43] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018.
  • [44] H. Van Trees.

    Detection, Estimation, and Modulation Theory, Optimum Array Processing.

    Wiley, 2002.
  • [45] M. T. Wallace, M. A. Meredith, and B. E. Stein. Converging influences from visual, auditory, and somatosensory cortices onto output neurons of the superior colliculus. Journal of Neurophysiology, 69(6):1797–1809, 1993. PMID: 8350124.
  • [46] B. P. Yuhas, M. H. Goldstein, and T. J. Sejnowski. Integration of acoustic and visual speech signals using neural networks. IEEE Communications Magazine, 27(11):65–71, Nov 1989.
  • [47] H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba. The sound of pixels. In The European Conference on Computer Vision (ECCV), September 2018.
  • [48] A. Zunino, M. Crocco, S. Martelli, A. Trucco, A. D. Bue, and V. Murino. Seeing the sound: A new multimodal imaging device for computer vision. In 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), pages 693–701, Dec 2015.

Appendix A Data Preparation

We implemented all of our networks and our data processing pipeline using TensorFlow. In particular we store our dataset in multiple compressed TFRecord files, each of which contains 1 second of synchronized data from the three modalities, video images, raw audio waveforms, and acoustic images. We use the API to retrieve this data and compose at runtime variable length sequences. We grouped contiguous TFRecord files into full audio-video sequences and then randomly sampled shorter length sequences, e.g. we compose a full audio-video sequence of 30 seconds and sample from it 10 sequences of 5 seconds.

Appendix B Dataset Splitting

In Section 3 of the paper, we mentioned our dataset consists of 378 audio-video sequences from 30 to 60 seconds each. However we did not comment on how it was split for training purposes. Since only a few sequences were longer than 30 seconds, and in order to keep a balanced dataset, we cropped all the sequences up to 30 seconds and assign 80% of them for training, 10% for validation and 10% for test.

Splitting the dataset this way accounts for 302 training sequences, 39 validation sequences, and 37 test sequences. We then extracted sequences of the desired length. In case that the required length was 1 second we extracted 30 samples, while in case the required length was 5 seconds we extracted 6 samples. Extracting more samples would result in a high load of data repeated. Finally to keep some consistence across the experiments, we used a fixed seed for random crops extraction and the epoch number as seed for data shuffling.

Appendix C Hyperparameter Optimization

In Section 6 of the paper, we presented the obtained experimental results and mentioned that in some cases we used a different learning rate. Basically we considered only two values, and . Table 5 shows the values used throughout all the experiments. For all teacher networks we used a learning rate of except for DualCamNet which required a bigger value. For the student networks (OursSoundNet and HearNet) we used a mix of both considered values, and almost always the same across all scenarios settings, except for HearNet when trained from DualCamNet soft labels on first scenario which required a smaller learning rate.

Network Learning rate
Temporal ResNet-50
HearNet (G)
HearNet (D) and
HearNet (R)
OurSoundNet (G)
OurSoundNet (D)
OurSoundNet (R)
Table 5: Training learning rates. Supervision is indicated as follows: (G): from ground truth hard labels, (D): from DualCamNet soft labels, (R): from ResNet-50 soft labels.

It is worth mentioning that in all cases when training our student networks with distillation, we performed hyperparameters optimization using grid search by cross-validation on the held-out validation set. We basically looked at three hyperparameters, learning rate (

), temperature value (), and imitation parameter ().

Finally, regarding the transfer learning results, also presented in Section 6 of the paper, we validated the considered number of nearest neighbors

. We used odd values between 1 and 9, and we found that the best results in the validation set were obtained with

except for SoundNet-5/conv5, pre-trained on Flickr-SoundNet dataset, where proved to be the best value.

Appendix D Dataset Qualitative Analysis

In this section we provide additional qualitative insights on the proposed dataset, which may clarify some statements made in the paper. We first illustrate the problem of visual clutter mentioned in Section 6 of the paper. Figure 6 shows three examples of actions performed over all three scenarios with varying conditions of visual clutter. Comparing scenarios 1 and 3, it can be observed that on the first case the object involved on the action execution is well visible in the foreground, making easier for the visual models to identify the corresponding action. With scenario 2 the difficulty is that often other people appear on the background or non-related objects are present on the foreground, thus making it harder to identify the action.

(a) Stick dropping
(b) Clicking
(c) Plastic crumpling
Figure 6: Comparison of three actions performed on all scenarios. From top to bottom, scenario 1 on the first row, scenario 2 on the second row, and scenario 3 on the third row.

A key finding on the paper was that models based on acoustic data achieved better classification results than models based on visual data. Here we illustrate the difficulty of identifying actions from visual data in contrast to identifying actions from audio data. Figure 7 shows two subjects on the third scenario performing three different actions each. It can be seen that some actions involving the same subject are visually similar although they depict completely different actions, but they are distinguishable by their acoustic signature.

(a) Speaking
(b) Paper ripping
(c) Paper shaking
(d) Clapping
(e) Snapping fingers
(f) Whistling
Figure 7: Comparison of six actions visually similar but distinguishable from audio. All six actions where performed on the third scenario corresponding to the terrace.

Looking more closely at Figure 7, it can be seen that some actions have a visually distinguishable pattern. For instance, “clapping” and “snapping fingers” have a periodic pattern and concentrate on the low frequencies rather than on the high ones. Such patterns are more difficult to grasp from raw waveform. This lead us to think that spectrograms are better audio representations since they summarize the scene acoustic content in a better way when compared to raw waveform. This observation gives some more clues into why HearNet performs better than OurSoundNet in many cases.

Figure 8 shows the spectrograms for the same action performed by three different subjects on the third location. There can be seen that the same pattern of multiple events spaced at short time intervals with the energy concentrated on the low frequencies, repeats across different subject executions.

(a) Subject 3
(b) Subject 4
(c) Subject 5
Figure 8: Comparison of the spectrograms for the “knocking” action performed by three distinct subjects on the third scenario.

Figure 9 compares the spectrograms of the audios of three different actions performed by the same subject on the three considered scenarios. Here we also see that the audios for the same actions share a visual pattern when visualized as a spectrogram, even when performed across locations. Interestingly, the cleanest spectrograms are those from actions performed at first scenario, while for second and third scenarios there are two different kinds of noise. In second scenario the noise is mainly due to indoor echoes, while for third scenario it is due to ambient noise.

(a) Knocking
(b) Speaking
(c) Playing kendama
Figure 9: Comparison of the spectrograms of three actions performed by the same subject at the three considered scenarios. From top to bottom, scenario 1 on the first row, scenario 2 on the second row, and scenario 3 on the third row.

Appendix E Dataset Quantitative Analysis

We report here the confusion matrices for all the student and teacher models, in order to get a deeper understanding of the dataset’s challenges.

Figure 10:

Hearnet trained on all scenarios confusion matrix.

For HearNet (Figure 10) we notice that Hammering is often confused with Knocking, Clicking with Typing, Paper shaking with Plastic crumpling. All the three pairs of classes, in fact, are very similar aurally.

Figure 11: OurSoundNet trained on all scenarios confusion matrix.

Regarding OurSoundNet (Figure 11) many classes are confused with Playing kendama and Stick dropping. Hammering and Knocking, Paper shaking and Stick dropping are confused with each other, Peanut breaking is always misclassified, probably because of its feeble audio pattern. As stated before, HearNet superior performance may be ascribed to its more powerful input representation (spectrogram).

We now consider the teachers confusion matrices. DualCamNet (Figure 12) and AVNet (Figure 15) confusion matrices have diagonal elements with very high values, indicating high accuracy (they are good teachers indeed). Temporal ResNet-50 in Figure 14 and ResNet-50 in Figure 13 confuse many classes with Clapping and Clicking. Whistling is always misclassified. As already certified by higher accuracy, we can conclude that are DualCamNet and AVNet are better teacher.

Figure 12: DualCamNet trained on all scenarios confusion matrix.
Figure 13: ResNet-50 trained on all scenarios confusion matrix.
Figure 14: Temporal ResNet-50 trained on all scenarios confusion matrix.
Figure 15: AVNet trained on all scenarios confusion matrix.

Finally we can see in detail ResNet-50 confusion matrices when trained and tested on scenario 1 in Figure  16, scenario 2 in Figure  17 and in scenario 3 in Figure  18. We notice that when trained and tested on scenario 1, ResNet-50 presents higher accuracies for all classes. In scenario 2 many classes are confused with Clapping, in scenario 3 with Knocking. In particular, we see in scenario 1 that Snapping fingers, Speaking and Plastic Crumpling are the more difficult to recognize. In scenario 2 Speaking, Snapping fingers, Playing kendama and Paper shaking have low accuracies. In scenario 3 many classes have low results, for e.g. Clapping and Snapping fingers. As a matter of fact these classes are visually similar to other ones or sometimes the visual part of the images to recognize the action are occluded or there are other objects and they can be misunderstood. This confirms the hypothesis made before in Section D in Figure 7 are true.

Figure 16: ResNet-50 trained and tested on scenario 1 confusion matrix.
Figure 17: ResNet-50 trained and tested on scenario 2 confusion matrix.
Figure 18: ResNet-50 trained and tested on scenario 3 confusion matrix.

Appendix F Reproducibility

To enable reproducibility of our results and to motivate further research on deep learning for acoustic images, our code444, data, and models will be publicly released.