Understanding the surrounding environment through sounds has been considered as a major component in many daily applications, such as surveillance, smart city, and smart cars [1, 2]. Recent years have witnessed great progress in sound event detection and classification using deep learning techniques [3, 4, 5, 6, 7]
. However, most prior arts rely on standard supervised learning algorithm and may not perform well for sound events with sparse training examples. While such afew-shot learning8], little work if any has been done for few-shot sound event recognition, to our best knowledge.
uses the cosine similarity to measure the distance of a learned representation of the labeled set of examples with a giving unlabeled example for classification. Importantly, it proposes anepisode procedure during training, which samples only a few examples of each class as data points to simulate the few-shot learning scenario. Such a procedure allows the training phase to be close to the test phase in few-shot learning and accordingly improves the model’s generalization ability. Snell et al.  follows the episodic procedure and proposes the Prototypical Network. They take the average of the learned representation of a few examples for each class as a class-wise representation, and then classifies an input unlabeled datum by calculating the Euclidean distance between the input and the class-wise representations. Sung et al.  proposes the Relation Network to learn a non-linear distance metric for measuring the distance between the input unlabeled examples and a few examples from each class.
Most sound event recognition models are trained on datasets with clip-level labels, such as DCASE2016 , ESC-50 , and AudioSet . Because the clip-level labels do not specify where the corresponding event actually takes place in an audio signal, this training strategy may make a model overlook short or transient sound events . To tackle this issue for few-shot sound recognition, in this paper, we propose a novel attentional similarity module to automatically guide the model to pay attention to specific segments of a long audio clip for recognizing relatively short or transient sound events. We show that our attentional similarity module can be learned relying on only clip-level annotation, and that it can be plugged into any existing methods to improve their performance for few-shot sound recognition.
2.1 Few-shot sound recognition
The goal of few-shot sound recognition is to learn a classifier which can quickly accommodate to unseen classes with only a few examples. In training, we are given a training set , where is a support set, denotes the class of the support set and is the total number of classes in the training set. Generally, the support set consists of small support examples and a query example (see Fig. 2), where is the input feature and is the number of support sets. The support examples are randomly sampled examples from each of the classes in the training set , and the query sample is randomly chosen from the remaining examples of classes. The task is called -way -shot learning. And, is often a small number from 1 to 5.
In our work, we use the simple yet powerful ConvNet architecture as our feature learning model . The model can be learnt by minimizing the following objective function:
is a loss function,is the parameters of network and is a regularization term for avoiding overfitting.
where is the output (a.k.a. a feature map) from the last convolutional layer of , denotes the set of inputs labeled with class and is the similarity function for measuring the distance between two inputs.
|Model||Attentional||Depth||Param.||5-way Acc||10-way Acc|
|Siamese Network ||3||1.18M||43.5%||50.9%||26.1%||31.6%|
|Matching Network ||3||1.18M||53.7%||67.0%||34.5%||47.9%|
|Relation Network ||7||4.78M||60.0%||70.3%||41.7%||52.0%|
|Similarity Embedding Network ||8||1.61M||61.0%||78.1%||45.2%||65.7%|
|Prototypical Network ||3||1.18M||67.9%||83.0%||46.2%||74.2%|
|Siamese Network ||✓||3+1||2.50M||49.3%||58.6%||29.0%||39.0%|
|Matching Network ||✓||3+1||2.50M||59.0%||74.0%||38.8%||55.3%|
|Relation Network ||✓||7+1||6.11M||64.0%||74.4%||46.0%||57.0%|
|Similarity Embedding Network ||✓||8+1||3.40M||71.2%||82.0%||56.9%||71.0%|
|Prototypical Network ||✓||3+1||2.50M||74.0%||87.7%||55.0%||76.5%|
|Model||Attentional||Depth||Param.||5-way Acc||10-way Acc|
|Siamese Network ||3||1.18M||38.2%||43.5%||25.0%||28.0%|
|Matching Network ||3||1.18M||51.0%||61.5%||31.7%||43.0%|
|Relation Network ||7||4.78M||56.2%||74.5%||39.2%||52.5%|
|Similarity Embedding Network ||8||1.61M||63.2%||78.5%||44.2%||62.0%|
|Prototypical Network ||3||1.18M||66.2%||83.0%||46.5%||72.2%|
|Siamese Network ||✓||3+1||2.50M||46.0%||50.0%||29.0%||29.7%|
|Matching Network ||✓||3+1||2.50M||52.7%||66.5%||36.2%||48.2%|
|Relation Network ||✓||7+1||6.11M||61.0%||76.2%||40.0%||59.2%|
|Similarity Embedding Network ||✓||8+1||3.40M||70.2%||83.2%||49.2%||67.2%|
|Prototypical Network ||✓||3+1||2.50M||69.7%||85.7%||51.5%||73.5%|
2.2 Attentional similarity
to yield a fixed-length vector, where is the number of channels and is the number of the temporal dimension. The similarity function can be written as:
where is the pooling function and any distance function between two vectors, such as the inner product.
: A recent work on second-order similarity estimation computes the segment-by-segment (second-order) similarity between two inputs using the feature maps of the last layer of the ConvNet before pooling. Compared with the fixed-length vector (clip-level feature), this method allows the model to use segment-level feature to learn the temporal correlation between two inputs. The second-order similarity can be written as follows:
Being inspired by this method, we propose to learn a weight to generate the attentional second-order similarity to capture the importance of segment-by-segment similarity. We can rewrite Eq. (4) as:
Following , we can compute attentional similarity by approximating weight as a rank-1 approximation, where . Then, we can derive the following attentional similarity function:
where is the trace operator and is the attention vector computed by using another stack of convolutional layers by feeding to find the important of segments. Eq. (8) uses the attention vector and to compute a weighted average of segment-by-segment similarity. More importantly, Eq. (8) can be rewritten as follows:
The final equation can be interpreted as we compute the similarity score by using the inner product between two attentional vector and . This allows us to replace the inner product with common distance functions (e.g., cosine similarity or Euclidean distance) to measure the distance between two attentional vectors. So, in general the attentional similarity can also be computed as .
3.1 Experimental Settings
Dataset: We conduct experiments on few-shot sound recognition using two datasets: ESC-50 and noiseESC-50. The ESC-50 dataset  contains 2,000 5-seconds audio clips labeled with 50 classes, each having 40 examples. The sound categories cover sounds of Animals, Natural, Human and Ambient noises. To evaluate the models under background noise conditions, following , we create the second dataset, coined noiseECS-50, by augmenting ESC-50 audio clips with additive background noise randomly selected from audio recordings of 15 different acoustic scenes from the DCASE2016 dataset . Such a synthetic strategy allows to generate artificially noisy audio examples that reflect the sound recordings in everyday environment. Therefore, evaluation on noiseECS-50 may better measure how the models perform in real-world applications.
We note that, although the size and vocabulary of ESC-50 is small and limited, it is appropriate to use it as a public benchmark dataset for few-shot sound recognition. Larger benchmark datasets such as AudioSet  may suffer from the openness issue of audio data  and class imbalance problems .
Data preparation: In order to directly compare our model against strong baselines for few-shot learning, our experiment uses the similar splits proposed by . The 50 sound event classes from ESC-50 dataset are divided into 35 classes for training and 10 classes for test. We train our model and baselines on 35 classes and use the remaining 5 validation classes for selecting the final model.
: To speed up model training, all audio clips from the ESC-50 and DCASE2016 datasets are downsampled from 44.1 kHz to 16 kHz. We extract the 128-bin log mel-spectrogram from raw audio as the input feature to the neural networks. Thelibrosa library 
Network design: The backbone network is based on the simple yet powerful CNN structure, which has been widely used in audio tasks [16, 17]. The input feature of network is a mel-spectrograms, with 128 frequency bins and 160 frames. Our backbone network consists of a stack of blocks, each of which has a
convolutional layer followed by batch normalization
, a ReLU activation layer and amax-pooling layer. For optimization, we use stochastic gradient descent (SGD) and initial learning rate of 0.01. The learning rate is divided by 10 every 20 epochs for annealing, and we set the maximal number of epochs to 60. Moreover, we set the weight decay to 1e-4 to avoid overfitting.
Please note that all the baselines and our model are used the same backbone network. In our pilot studies, we have also explored using other advanced CNN structure such as ResNet  or VGG  as our backbone network but seen no much improvement. This may be due to the moderate size of the ESC-50 dataset. For reproducibility, we will release the source code through a GitHub repo.
3.2 Experimental Results
Table 1 compares the performance of our own implementation of five metric-based learning methods for few-shot sound recognition with and without the proposed attentional similarity module on ESC-50. We can see that using attentional similarity clearly improves all the existing methods, giving rise to +4% to +7.1% relative improvement in 5-way 5-shot learning, a large performance gain. According to our experiment, the prototypical network turns out to be more effective and efficient than the relation network and the similarity embedding network for few-shot sound recognition. This seems to support similar findings in computer vision tasks [11, 10].
Table 2 shows the experimental result on noiseECS-50. Again, we see that the attentional similarity module consistently improves the result of existing methods, and the relative performance gain ranges from +2.1% to +6.5% in 5-way 5-shot learning. Compared with ESC-50, the performance gain decreases slightly, possibly because sound event recognition over noiseESC-50 is more challenging. Finally, similar to the case in ESC-50, the attentional similarity-empowered prototypical network achieves the best result among the evaluated methods by a great margin in 5-shot learning.
Figure 3 gives a qualitative comparison of the result of the prototypical network with and without attentional similarity for 3 query examples from the noiseESC-50 test set. Those picked by the model without attentional similarity (i.e., marked in cyan) do not share the same class as the queries; they are picked possibly because both the query and the picked one have a long silence. In contrast, the model with attentional similarity finds correct matches (marked in red).
We have introduced a simple module of attentional similarity for few-shot sound recognition to generate an attentional representation of inputs. It allows the model to ignore the unrelated background noise while matching relative short sound events. Extensive experiments show that attentional similarity consistently improves the performance of various existing methods on datasets of either noise-free or noisy clips. In the future, we plan to extend the model to adopt a multi-label learning setting for few-shot sound recognition.
Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen,
“TUT database for acoustic scene classification and sound event detection,”in Proc. EUSIPCO, 2016, pp. 1128–1132.
-  Annamaria Mesaros, Toni Heittola, Aleksandr Diment, Benjamin Elizalde, Ankit Shah, Emmanuel Vincent, Bhiksha Raj, and Tuomas Virtanen, “DCASE 2017 Challenge setup: Tasks, datasets and baseline system,” in Proc. DCASE, 2017.
-  Jen-Yu Liu and Yi-Hsuan Yang, “Event localization in music auto-tagging,” in Proc. ACM MM, 2016.
-  Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron Weiss, and Kevin Wilson, “CNN architectures for large-scale audio classification,” in Proc. ICASSP. 2017.
-  Toan Vu, An Dang, and Jia-Ching Wang, “Deep learning for DCASE2017 Challenge,” in Proc. DCASE, 2017.
Donmoon Lee, Subin Lee, Yoonchang Han, and Kyogu Lee,
“Ensemble of convolutional neural networks for weakly-supervised sound event detection using multiple scale input,”in Proc. DCASE, 2017.
-  Ting-Wei Su, Jen-Yu Liu, and Yi-Hsuan Yang, “Weakly-supervised audio event detection using event-specific gaussian filters and fully convolutional networks,” in Proc. ICASSP, 2017.
-  Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra, “Matching networks for one shot learning,” in Proc. NIPS, pp. 3630–3638. 2016.
-  Richard Zemel Gregory Koch and Ruslan Salakhutdinov, “Siamese neural networks for one-shot image recognition,” in Proc. ICML Deep Learning workshop, 2015.
-  Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales, “Learning to compare: Relation network for few-shot learning,” in Proc. CVPR, 2018.
-  Jack Snell, Kevin Swersky, and Richard Zemel, “Prototypical networks for few-shot learning,” in Proc. NIPS, 2017.
-  Karol J. Piczak, “ESC: Dataset for Environmental Sound Classification,” in Proc. ACM MM, 2015, pp. 1015–1018.
-  Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in Proc. ICASSP, 2017.
-  Szu-Yu Chou, Jyh-Shing Roger Jang, and Yi-Hsuan Yang, “Learning to recognize transient sound events using attentional supervision,” in Proc. IJCAI, 2018, pp. 3336–3342.
-  Yu-Siang Huang, Szu-Yu Chou, and Yi-Hsuan Yang, “Generating music medleys via playing music puzzle games,” in Proc. AAAI, 2018.
-  Aaron van den Oord, Sander Dieleman, and Benjamin Schrauwen, “Deep content-based music recommendation,” in Proc. NIPS, 2013.
-  Anurag Kumar, Maksim Khadkevich, and Christian Fügen, “Knowledge transfer from weakly labeled audio using convolutional neural network for sound events and scenes,” in Proc. ICASSP, 2018.
-  Rohit Girdhar and Deva Ramanan, “Attentional pooling for action recognition,” in Proc. NIPS, 2017.
-  Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andrés Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, “Freesound datasets: a platform for the creation of open audio datasets,” in Proc. ISMIR, 2017, pp. 486–493.
-  C. Huang, Y. Li, C. Change Loy, and X. Tang, “Learning deep representation for imbalanced classification,” in Proc. CVPR, 2016.
-  Brian McFee, Colin Raffel, Dawen Liang, Daniel P. W. Ellis, Matt McVicar, Eric Battenberg, and Oriol Nietok, “librosa: Audio and music signal analysis in Python,” in Proc. scipy, 2015.
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. ICML, 2015.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proc. CVPR, 2016.
-  Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. ICLR, 2015.