Environmental sounds are essential for expressive media content, e.g., movies, video games, and animation, to make them immersive and realistic. One way to prepare a desired sound is to obtain it from an environmental sound database. However, the number of databases currently available is very limited , so the desired sound is not always in the database. On the other hand, there is a large amount of unlabeled environmental sounds on the Internet, but it is not easy to expand the database because it requires rich domain knowledge and taxonomy.
Even if the database became large, its usability might decrease because it would also require users to have domain knowledge. Intuitive methods for sound retrieval have been proposed. For example, vocal imitation [21, 23, 13, 22] and onomatopoeia  were used as a search query in some sound retrieval systems. It has also been reported that user satisfaction is high when using the intuitive sound-retrieval technique [21, 23]. Therefore, it would also be useful for content creators if they can extract a desired sound intuitively.
We propose an environmental-sound-extraction method using onomatopoeia, which is a character sequence that phonetically imitates a sound. It has been shown that onomatopoeia is effective in expressing the characteristics of sound [15, 1] such as sound duration, pitch, and timbre. Onomatopoeia is also advantageous in terms of low labeling cost since it does not require domain knowledge and taxonomy for labeling. Our proposed method thus uses onomatopoeia to specify the sound to extract from a mixture sound, as shown in Fig. 1. We used a U-Net architecture , which has been used in various source-separation and sound-extraction studies [4, 5, 11, 14], to estimate the time-frequency mask of the target sound. To the best of our knowledge, there has been no study on extracting only specific sound by using onomatopoeia.
The rest of the paper is organized as follows. In Sec. 2, we describe related work on environmental sound extraction. In Sec. 3, we present our proposed method for extracting environmental sounds using onomatopoeia from an input mixture. In Sec. 4, we discuss experiments we conducted on the effectiveness of our proposed method compared with baseline methods that use class labels to specify the target sound. Finally, we summarize and conclude this paper in Sec. 5.
2 Related Work
Methods of environmental sound extraction and separation using deep learning have been developed [5, 6, 3, 12]. Sudo et al. developed an environmental-sound-separation method based on U-Net architecture . A similar method using U-Net was also proposed for source separation [4, 19]. Ochiai et al. used Conv-TasNet , which was originally proposed for speech separation, to extract only the sounds of specific sound events . These methods use the sound-event class as the input to specify the desired sound. However, environmental sounds have various characteristics that cannot be described as a sound class, such as sound duration, pitch, and timbre. For example, if the “whistle sound” class is defined regardless of the pitch, it is not possible for a conventional method to extract only the sound of the desired pitch. One possible solution is to define more fine-grained sound-event classes, e.g., “high-pitched whistle sound” and “low-pitched whistle sound.” However, this is impractical because the labeling cost will increase. Even if we could define such fine-grained sound-event classes, there would always be intra-class variation, and we have no way to distinguish between them. Therefore, the method of conditioning by sound-event class is not suitable to extract specific sounds.
3 Proposed Method
3.1 Overview of environmental sound extraction using onomatopoeia
3.2 Proposed sound extraction method
shows the detailed architecture of the proposed method. The method involves time-frequency mask estimation using U-Net and feature vector extraction from an onomatopoeia. We condition the output of the U-Net encoder with onomatopoeia to specify the target environmental sound to extract. In previous studies, the target sound to be extracted was conditioned by sound-event class, or further conditioned by the estimated interval of the target sound 
. These studies have shown that conditioning on intermediate features after passing through convolutional neural network layers can be effective. Thus, we also use conditioning on the intermediate features of the U-Net encoder.
The proposed method takes the following as inputs, as shown in Fig. 2. One is a -length -dimensional mixture spectrogram extracted from the input mixture sound
. The other is a one-hot encoded phoneme sequenceextracted from . The extracted acoustic feature is fed to the U-Net encoder, which consists of -stacked convolutional layers. In each layer of the U-Net encoder, the time-frequency dimension decreases by half and the number of channels doubles. As a result, feature maps are calculated as
where denotes the feature map of the -th channel.
At the same time, the phoneme sequence
is fed to the bidirectional long short-term memory (BiLSTM) encoder. As a result, a-dimensional word-level embedding that captures the entire onomatopoeia is extracted as follows:
The extracted embedding is stretched in the time and frequency directions to form feature maps , where for is the matrix the elements of which are all .
Finally, a time-frequency soft mask is estimated using the U-Net decoder, which consists of -stacked deconvolutional layers. The feature maps from the U-Net encoder and BiLSTM encoder are concatenated to be
channels and fed to the U-Net decoder followed by the element-wise sigmoid functionas
The target signal in time-frequency domainis then recovered by masking the input as
where is the Hadamard product.
During training, the loss function defined as root mean square error betweenand target features , which is extracted from , is used:
where is the Frobenius norm.
In the inference phase, we reconstruct an environmental sound wave from the masked acoustic features using the Griffin–Lim algorithm .
4.1 Dataset construction
To construct the datasets for this task, we used environmental sounds extracted from RealWorld Computing Partnership-Sound Scene Database (RWCP-SSD) . Some sound events in RWCP-SSD are labeled in the “event entry + ID” format, e.g., whistle1 and whistle2. We created hierarchical sound-event classes by grouping labels with the same event entry, e.g., whistle. We first selected 44 sound events from RWCP-SSD, which we call subclasses, and grouped them into 16 superclasses. The superclasses and subclasses used in this study are listed in Table 2. We selected 16 types of sound events in superclass and 44 types of sound events in subclass from RWCP-SSD to construct the dataset. The sounds in each subclass were divided as 7:2:1, used for training, validation, and evaluation, respectively. The onomatopoeias corresponding to each environmental sound were extracted from RWCP-SSD-Onomatopoeia . Each sound was annotated with more than 15 onomatopoeias in RWCP-SSD-Onomatopoeia, and we used randomly selected three onomatopoeias for each sound for our experiments.
We constructed the following three evaluation datasets using the selected sound events:
Inter-superclass dataset: Each mixture sound in this dataset is composed of a target sound and interference sounds, the superclass of each is different from that of the target sound.
Intra-superclass dataset: Each mixture sound in this dataset is composed of a target sound and interference sounds, the superclass of each is the same as that of the target sound, but the subclass is different.
Intra-subclass dataset: Each mixture sound in this dataset is composed of a target sound and interference sounds, the subclass of each is the same as that of the target sound, but the onomatopoeia is different.
The mixture sounds in each dataset were created by varying the signal-to-noise ratio (SNR) by. The SNR between a target signal and an interference signal is defined as
The training and validation sets consisted of 7,563 and 2,160 mixture sounds, respectively. Each evaluation set consisted of 1,107 mixture sounds for each SNR. The audio clips for these sets were randomly selected from RWCP-SSD.
|Waveform encoding||16-bit linear PCM|
|# of U-Net encoder blocks||4|
|# of U-Net decoder blocks||4|
|# of BiLSTM encoders||1|
|# of units in BiLSTM encoder||512|
|Acoustic feature||Amplitude spectrogram|
|Window length for FFT||(2,048 samples)|
|Window shift for FFT||(512 samples)|
|metal||metal05, metal10,||bells||bells1, bells2, bells3,|
|dice||dice1, dice2, dice3||coin||coin1, coin2, coin3|
|bottle||bottle1, bottle2||coins||coins1, coins2, coins3,|
|cup||cup1, cup2||coins4, coins5|
|particl||particl1, particl2||whistle||whistle1, whistle2,|
|clap||clap1, clap2||phone||phone1, phone2,|
|claps||claps1, claps2||phone3, phone4|
|clip||clip1, clip2||toy||toy1, toy2|
|Inter-superclass dataset||Superclass-conditioned method|
|Intra-superclass dataset||Superclass-conditioned method|
|Intra-subclass dataset||Superclass-conditioned method|
4.2 Training and evaluation setup
Table 1 shows the experimental conditions and parameters used for the proposed method (onomatopoeia-conditioned method). As baselines, we also evaluated the methods with which the target sound is conditioned by the superclass or subclass sound-event class. We used the one-hot representation of the label for in (3) instead of the word embeddings.
To evaluate each method, we used signal-to-distortion ratio improvement (SDRi) 
as an evaluation metric. SDRi is defined as the difference between the SDR of the target sound to the mixture and that of the target sound to the extracted sound as follows:
We conducted evaluations regarding SDRi on each of the three evaluation datasets introduced in Sec. 4.1.
4.3 Experimental results
Table 3 shows the SDRi on each evaluation dataset. We observed that the superclass-conditioned method performed well on the inter-superclass dataset but performed poorly on the intra-superclass and intra-subclass datasets. We also observed that the subclass-conditioned method performed well on the inter-superclass and intra-superclass datasets but did not on the intra-subclass dataset. These results indicate that the performance of sound extraction using an event class as a condition is highly dependent on the fineness of the class definition. The onomatopoeia-conditioned method showed almost the same SDRi on the three datasets. This suggests that onomatopoeia can behave like a more fine-grained class than the subclasses, even though it does not require any special domain knowledge for labeling.
Figure 3 shows the spectrograms of the extracted sounds using the subclass-conditioned and onomatopoeia-conditioned methods. For this visualization, we used five samples in the intra-subclass dataset with . We observed that the subclass-conditioned method left a significant amount of non-target sounds, while the onomatopoeia-conditioned method extracted only the target sound. Although the onomatopoeia-conditioned method performed better than the superclass- and subclass-conditioned methods, it still does not perform well when the target sound is highly overlapped with interference sounds (cf. “Subclass: Phone4” in Fig. 3 ). The extraction of overlapping sounds requires further study. The extracted sounds are available on our web page111https://y-okamoto1221.github.io/Sound_Extraction_Onomatopoeia/.
We proposed an environmental-sound-extraction method using onomatopoeia. The proposed method estimates a time-frequency mask of the target sound specified by onomatopoeia with the U-Net encoder-decoder architecture. The experimental results indicate that our proposed method extracts specific sounds from mixture sounds by using onomatopoeia as a condition. Our proposed method outperformed conventional methods that use a sound-event class as a condition. The results indicate that onomatopoeia can behave like a more fine-grained class than sound-event classes, even though it does not require any special domain knowledge for labeling. In the future, it will be necessary to verify the effectiveness of the proposed method for onomatopoeia assigned by speakers of different languages.
Vector-based representation and clustering of audio using onomatopoeia words.
Proc. American Association for Artificial Intelligence (AAAI) Symposium Series, , pp. 55–58. Cited by: §1.
-  (2018) Acoustic event search with an onomatopoeic query: measuring distance between onomatopoeic words and sounds. In Proc. Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), , pp. 59–63. Cited by: §1.
-  (2019) Audio query-based music source separation. In Proc. International Society for Music Information Retrieval (ISMIR), , pp. 878–885. Cited by: §2.
-  (2019) Conditioned-U-Net: introducing a control control mechanism in the U-Net for multiple source separations. In Proc. International Society for Music Information Retrieval (ISMIR), , pp. 159–165. Cited by: §1, §2, §3.2.
-  (2019) Environmental sound segmentation utilizing Mask U-Net. In Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), , pp. 5340–5345. Cited by: §1, §2, §3.2.
Listen to what you want: neural network-based universal sound selector. In Proc. INTERSPEECH, , pp. 1441–1445. Cited by: §2.
On the variance of the adaptive learning rate and beyond. In Proc. International Conference on Learning Representation (ICLR), , pp. 1–13. Cited by: Table 1.
-  (2020) RWCP-SSD-Onomatopoeia: onomatopoeic words dataset for environmental sound synthesis. In Proc. Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), , pp. 125–129. Cited by: §4.1.
Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing 32 (2), pp. 236–243. Cited by: §3.2.
-  (2018) Introduction to acoustic event and scene analysis. Acoust. Sci. Tech. 39 (3), pp. 182–188. Cited by: §1.
-  (2017) Singing voice separation with deep U-Net convolutional networks. In Proc. International Society for Music Information Retrieval (ISMIR), pp. 745–751. Cited by: §1.
-  (2019) Universal sound separation. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Vol. , pp. 175–179. Cited by: §2.
-  (2019) Improving content-based audio retrieval by vocal imitation feedback. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 4100–4104. Cited by: §1.
-  (2020) Source separation with weakly labelled data: an approach to computational auditory scene analysis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 101–105. Cited by: §1.
-  (2014-Feb.) On the effectiveness of vocal imitations and verbal descriptions of sounds. The Journal of the Acoustical Society of America 135 (2), pp. 862–873. Cited by: §1.
-  (2019) Conv-TasNet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (8), pp. 1256–1266. Cited by: §2.
-  (1999) Sound scene data collection in real acoustical environments. The Journal of the Acoustic Society of Japan (E) 20 (3), pp. 225–231. Cited by: §4.1.
-  (2015) U-Net: convolutional networks for biomedical image segmentation. In Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Vol. , pp. 234–241. Cited by: §1.
-  (2019) End-to-end sound source separation conditioned on instrument labels. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 306–310. Cited by: §2.
-  (2006) Performance measurement in blind audio source separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 14 (4), pp. 1462–1469. Cited by: §4.2.
-  (2020) Vroom!: a search engine for sounds by vocal imitation queries. In Proc. Conference on Human Information Interaction and Retrieval (CHIIR), Vol. , pp. 23–32. Cited by: §1.
-  (2019) Siamese style convolutional neural networks for sound search by vocal imitation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (2), pp. 429–441. Cited by: §1.
-  (2016) IMISOUND: an unsupervised system for sound query by vocal imitation. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 2269–2273. Cited by: §1.