Environmental Sound Extraction Using Onomatopoeia

12/01/2021
by   Yuki Okamoto, et al.
5

Onomatopoeia, which is a character sequence that phonetically imitates a sound, is effective in expressing characteristics of sound such as duration, pitch, and timbre. We propose an environmental-sound-extraction method using onomatopoeia to specify the target sound to be extracted. With this method, we estimate a time-frequency mask from an input mixture spectrogram and onomatopoeia by using U-Net architecture then extract the corresponding target sound by masking the spectrogram. Experimental results indicate that the proposed method can extract only the target sound corresponding to onomatopoeia and performs better than conventional methods that use sound-event classes to specify the target sound.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 5

02/11/2021

Onoma-to-wave: Environmental sound synthesis from onomatopoeic words

In this paper, we propose a new framework for environmental sound synthe...
04/02/2022

Improving Target Sound Extraction with Timestamp Information

Target sound extraction (TSE) aims to extract the sound part of a target...
11/25/2017

Assessment of sound spatialisation algorithms for sonic rendering with headsets

Given an input sound signal and a target virtual sound source, sound spa...
01/29/2018

Local Visual Microphones: Improved Sound Extraction from Silent Video

Sound waves cause small vibrations in nearby objects. A few techniques e...
09/25/2012

Environmental Sounds Spectrogram Classification using Log-Gabor Filters and Multiclass Support Vector Machines

This paper presents novel approaches for efficient feature extraction us...
04/05/2022

RaDur: A Reference-aware and Duration-robust Network for Target Sound Detection

Target sound detection (TSD) aims to detect the target sound from a mixt...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Environmental sounds are essential for expressive media content, e.g., movies, video games, and animation, to make them immersive and realistic. One way to prepare a desired sound is to obtain it from an environmental sound database. However, the number of databases currently available is very limited [10], so the desired sound is not always in the database. On the other hand, there is a large amount of unlabeled environmental sounds on the Internet, but it is not easy to expand the database because it requires rich domain knowledge and taxonomy.

Even if the database became large, its usability might decrease because it would also require users to have domain knowledge. Intuitive methods for sound retrieval have been proposed. For example, vocal imitation [21, 23, 13, 22] and onomatopoeia [2] were used as a search query in some sound retrieval systems. It has also been reported that user satisfaction is high when using the intuitive sound-retrieval technique [21, 23]. Therefore, it would also be useful for content creators if they can extract a desired sound intuitively.

Figure 1: Overview of environmental sound extraction using onomatopoeia
Figure 2: Detailed architecture of proposed environmental-sound-extraction method using onomatopoeia.

We propose an environmental-sound-extraction method using onomatopoeia, which is a character sequence that phonetically imitates a sound. It has been shown that onomatopoeia is effective in expressing the characteristics of sound [15, 1] such as sound duration, pitch, and timbre. Onomatopoeia is also advantageous in terms of low labeling cost since it does not require domain knowledge and taxonomy for labeling. Our proposed method thus uses onomatopoeia to specify the sound to extract from a mixture sound, as shown in Fig. 1. We used a U-Net architecture [18], which has been used in various source-separation and sound-extraction studies [4, 5, 11, 14], to estimate the time-frequency mask of the target sound. To the best of our knowledge, there has been no study on extracting only specific sound by using onomatopoeia.

The rest of the paper is organized as follows. In Sec. 2, we describe related work on environmental sound extraction. In Sec. 3, we present our proposed method for extracting environmental sounds using onomatopoeia from an input mixture. In Sec. 4, we discuss experiments we conducted on the effectiveness of our proposed method compared with baseline methods that use class labels to specify the target sound. Finally, we summarize and conclude this paper in Sec. 5.

2 Related Work

Methods of environmental sound extraction and separation using deep learning have been developed [5, 6, 3, 12]. Sudo et al. developed an environmental-sound-separation method based on U-Net architecture [5]. A similar method using U-Net was also proposed for source separation [4, 19]. Ochiai et al. used Conv-TasNet [16], which was originally proposed for speech separation, to extract only the sounds of specific sound events [6]. These methods use the sound-event class as the input to specify the desired sound. However, environmental sounds have various characteristics that cannot be described as a sound class, such as sound duration, pitch, and timbre. For example, if the “whistle sound” class is defined regardless of the pitch, it is not possible for a conventional method to extract only the sound of the desired pitch. One possible solution is to define more fine-grained sound-event classes, e.g., “high-pitched whistle sound” and “low-pitched whistle sound.” However, this is impractical because the labeling cost will increase. Even if we could define such fine-grained sound-event classes, there would always be intra-class variation, and we have no way to distinguish between them. Therefore, the method of conditioning by sound-event class is not suitable to extract specific sounds.

3 Proposed Method

3.1 Overview of environmental sound extraction using onomatopoeia

Our purpose was to reconstruct a target sound from a mixture sound , where the target is specified by an onomatopoeia . We estimate from and

using a non-linear transformation

as follows:

(1)

We explain this in Sec. 3.2.

3.2 Proposed sound extraction method

Figure 2

shows the detailed architecture of the proposed method. The method involves time-frequency mask estimation using U-Net and feature vector extraction from an onomatopoeia. We condition the output of the U-Net encoder with onomatopoeia to specify the target environmental sound to extract. In previous studies, the target sound to be extracted was conditioned by sound-event class

[4], or further conditioned by the estimated interval of the target sound [5]

. These studies have shown that conditioning on intermediate features after passing through convolutional neural network layers can be effective. Thus, we also use conditioning on the intermediate features of the U-Net encoder.

The proposed method takes the following as inputs, as shown in Fig.  2. One is a -length -dimensional mixture spectrogram extracted from the input mixture sound

. The other is a one-hot encoded phoneme sequence

extracted from . The extracted acoustic feature is fed to the U-Net encoder, which consists of -stacked convolutional layers. In each layer of the U-Net encoder, the time-frequency dimension decreases by half and the number of channels doubles. As a result, feature maps are calculated as

(2)

where denotes the feature map of the -th channel.

At the same time, the phoneme sequence

is fed to the bidirectional long short-term memory (BiLSTM) encoder. As a result, a

-dimensional word-level embedding that captures the entire onomatopoeia is extracted as follows:

(3)

The extracted embedding is stretched in the time and frequency directions to form feature maps , where for is the matrix the elements of which are all .

Finally, a time-frequency soft mask is estimated using the U-Net decoder, which consists of -stacked deconvolutional layers. The feature maps from the U-Net encoder and BiLSTM encoder are concatenated to be

channels and fed to the U-Net decoder followed by the element-wise sigmoid function

as

(4)
(5)

The target signal in time-frequency domain

is then recovered by masking the input as

(6)

where is the Hadamard product.

During training, the loss function defined as root mean square error between

and target features , which is extracted from , is used:

(7)

where is the Frobenius norm.

In the inference phase, we reconstruct an environmental sound wave from the masked acoustic features using the Griffin–Lim algorithm [9].

4 Experiments

4.1 Dataset construction

To construct the datasets for this task, we used environmental sounds extracted from RealWorld Computing Partnership-Sound Scene Database (RWCP-SSD) [17]. Some sound events in RWCP-SSD are labeled in the “event entry + ID” format, e.g., whistle1 and whistle2. We created hierarchical sound-event classes by grouping labels with the same event entry, e.g., whistle. We first selected 44 sound events from RWCP-SSD, which we call subclasses, and grouped them into 16 superclasses. The superclasses and subclasses used in this study are listed in Table 2. We selected 16 types of sound events in superclass and 44 types of sound events in subclass from RWCP-SSD to construct the dataset. The sounds in each subclass were divided as 7:2:1, used for training, validation, and evaluation, respectively. The onomatopoeias corresponding to each environmental sound were extracted from RWCP-SSD-Onomatopoeia [8]. Each sound was annotated with more than 15 onomatopoeias in RWCP-SSD-Onomatopoeia, and we used randomly selected three onomatopoeias for each sound for our experiments.

We constructed the following three evaluation datasets using the selected sound events:

  • Inter-superclass dataset: Each mixture sound in this dataset is composed of a target sound and interference sounds, the superclass of each is different from that of the target sound.

  • Intra-superclass dataset: Each mixture sound in this dataset is composed of a target sound and interference sounds, the superclass of each is the same as that of the target sound, but the subclass is different.

  • Intra-subclass dataset: Each mixture sound in this dataset is composed of a target sound and interference sounds, the subclass of each is the same as that of the target sound, but the onomatopoeia is different.

The mixture sounds in each dataset were created by varying the signal-to-noise ratio (SNR) by

. The SNR between a target signal and an interference signal is defined as

(8)

The training and validation sets consisted of 7,563 and 2,160 mixture sounds, respectively. Each evaluation set consisted of 1,107 mixture sounds for each SNR. The audio clips for these sets were randomly selected from RWCP-SSD.

Mixture-sound length
Sampling rate
Waveform encoding 16-bit linear PCM
# of U-Net encoder blocks 4
# of U-Net decoder blocks 4
# of BiLSTM encoders 1
# of units in BiLSTM encoder 512
Batch size 8
Optimizer RAdam [7]
Acoustic feature Amplitude spectrogram
Window length for FFT (2,048 samples)
Window shift for FFT (512 samples)
Table 1: Experimental conditions
Superclass Subclass Superclass Subclass
metal metal05, metal10, bells bells1, bells2, bells3,
metal15 bells4, bells5
dice dice1, dice2, dice3 coin coin1, coin2, coin3
bottle bottle1, bottle2 coins coins1, coins2, coins3,
cup cup1, cup2 coins4, coins5
particl particl1, particl2 whistle whistle1, whistle2,
cap cap1, cap2 whistle3
clap clap1, clap2 phone phone1, phone2,
claps claps1, claps2 phone3, phone4
clip clip1, clip2 toy toy1, toy2
bell bell1, bell2
Table 2: Superclass and subclass sound events used in this study
SNR
Dataset Method
Inter-superclass dataset Superclass-conditioned method
Subclass-conditioned method
Onomatopoeia-conditioned method
Intra-superclass dataset Superclass-conditioned method
Subclass-conditioned method
Onomatopoeia-conditioned method
Intra-subclass dataset Superclass-conditioned method
Subclass-conditioned method
Onomatopoeia-conditioned method
Table 3: SDRi [dB] for extracted signals
Figure 3: Examples of environmental sound extraction using intra-subclass dataset. Mixture spectrogram (first row), results of subclass-conditioned sound extraction (second row), results of onomatopoeia-conditioned sound extraction (proposed) (third row), and ground truth spectrogram (fourth row).

4.2 Training and evaluation setup

Table 1 shows the experimental conditions and parameters used for the proposed method (onomatopoeia-conditioned method). As baselines, we also evaluated the methods with which the target sound is conditioned by the superclass or subclass sound-event class. We used the one-hot representation of the label for in (3) instead of the word embeddings.

To evaluate each method, we used signal-to-distortion ratio improvement (SDRi) [20]

as an evaluation metric. SDRi is defined as the difference between the SDR of the target sound to the mixture and that of the target sound to the extracted sound as follows:

(9)

We conducted evaluations regarding SDRi on each of the three evaluation datasets introduced in Sec. 4.1.

4.3 Experimental results

Table 3 shows the SDRi on each evaluation dataset. We observed that the superclass-conditioned method performed well on the inter-superclass dataset but performed poorly on the intra-superclass and intra-subclass datasets. We also observed that the subclass-conditioned method performed well on the inter-superclass and intra-superclass datasets but did not on the intra-subclass dataset. These results indicate that the performance of sound extraction using an event class as a condition is highly dependent on the fineness of the class definition. The onomatopoeia-conditioned method showed almost the same SDRi on the three datasets. This suggests that onomatopoeia can behave like a more fine-grained class than the subclasses, even though it does not require any special domain knowledge for labeling.

Figure 3 shows the spectrograms of the extracted sounds using the subclass-conditioned and onomatopoeia-conditioned methods. For this visualization, we used five samples in the intra-subclass dataset with . We observed that the subclass-conditioned method left a significant amount of non-target sounds, while the onomatopoeia-conditioned method extracted only the target sound. Although the onomatopoeia-conditioned method performed better than the superclass- and subclass-conditioned methods, it still does not perform well when the target sound is highly overlapped with interference sounds (cf. “Subclass: Phone4” in Fig. 3 ). The extraction of overlapping sounds requires further study. The extracted sounds are available on our web page111https://y-okamoto1221.github.io/Sound_Extraction_Onomatopoeia/.

5 Conclusion

We proposed an environmental-sound-extraction method using onomatopoeia. The proposed method estimates a time-frequency mask of the target sound specified by onomatopoeia with the U-Net encoder-decoder architecture. The experimental results indicate that our proposed method extracts specific sounds from mixture sounds by using onomatopoeia as a condition. Our proposed method outperformed conventional methods that use a sound-event class as a condition. The results indicate that onomatopoeia can behave like a more fine-grained class than sound-event classes, even though it does not require any special domain knowledge for labeling. In the future, it will be necessary to verify the effectiveness of the proposed method for onomatopoeia assigned by speakers of different languages.

References

  • [1] S. Sundaram and S. Narayanan (2006) Vector-based representation and clustering of audio using onomatopoeia words. In

    Proc. American Association for Artificial Intelligence (AAAI) Symposium Series

    ,
    , pp. 55–58. Cited by: §1.
  • [2] S. Ikawa and K. Kashino (2018) Acoustic event search with an onomatopoeic query: measuring distance between onomatopoeic words and sounds. In Proc. Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), , pp. 59–63. Cited by: §1.
  • [3] J. H. Lee, H.-S. Choi, and K. Lee (2019) Audio query-based music source separation. In Proc. International Society for Music Information Retrieval (ISMIR), , pp. 878–885. Cited by: §2.
  • [4] G. M.-Brocal and G. Peeters (2019) Conditioned-U-Net: introducing a control control mechanism in the U-Net for multiple source separations. In Proc. International Society for Music Information Retrieval (ISMIR), , pp. 159–165. Cited by: §1, §2, §3.2.
  • [5] Y. Sudo, K. Itoyama, K. Nishida, and K. Nakadai (2019) Environmental sound segmentation utilizing Mask U-Net. In Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), , pp. 5340–5345. Cited by: §1, §2, §3.2.
  • [6] T. Ochiai, M. Delcroix, Y. Koizumi, H. Ito, K. Kinoshita, and S. Araki (2020)

    Listen to what you want: neural network-based universal sound selector

    .
    In Proc. INTERSPEECH, , pp. 1441–1445. Cited by: §2.
  • [7] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han (2020)

    On the variance of the adaptive learning rate and beyond

    .
    In Proc. International Conference on Learning Representation (ICLR), , pp. 1–13. Cited by: Table 1.
  • [8] Y. Okamoto, K. Imoto, S. Takamichi, R. Yamanishi, T. Fukumori, and Y. Yamashita (2020) RWCP-SSD-Onomatopoeia: onomatopoeic words dataset for environmental sound synthesis. In Proc. Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), , pp. 125–129. Cited by: §4.1.
  • [9] D. Griffin and J. Lim (1984)

    Signal estimation from modified short-time Fourier transform

    .
    IEEE Transactions on Acoustics, Speech, and Signal Processing 32 (2), pp. 236–243. Cited by: §3.2.
  • [10] K. Imoto (2018) Introduction to acoustic event and scene analysis. Acoust. Sci. Tech. 39 (3), pp. 182–188. Cited by: §1.
  • [11] A. Jansson, E. J. Humphrey, N. Montecchio, R. M. Bittner, A. Kumar, and T. Weyde (2017) Singing voice separation with deep U-Net convolutional networks. In Proc. International Society for Music Information Retrieval (ISMIR), pp. 745–751. Cited by: §1.
  • [12] I. Kavalerov, S. Wisdom, H. Erdogan, B. Patton, K. Wilson, J. L. Roux, and J. R. Hershey (2019) Universal sound separation. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Vol. , pp. 175–179. Cited by: §2.
  • [13] B. Kim and B. Pardo (2019) Improving content-based audio retrieval by vocal imitation feedback. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 4100–4104. Cited by: §1.
  • [14] Q. Kong, Y. Wang, X. Song, Y. Cao, W. Wang, and M. D. Plumbley (2020) Source separation with weakly labelled data: an approach to computational auditory scene analysis. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 101–105. Cited by: §1.
  • [15] G. Lemaitre and D. Rocchesso (2014-Feb.) On the effectiveness of vocal imitations and verbal descriptions of sounds. The Journal of the Acoustical Society of America 135 (2), pp. 862–873. Cited by: §1.
  • [16] Y. Luo and N. Mesgarani (2019) Conv-TasNet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (8), pp. 1256–1266. Cited by: §2.
  • [17] S. Nakamura, K. Hiyane, F. Asano, and T. Endo (1999) Sound scene data collection in real acoustical environments. The Journal of the Acoustic Society of Japan (E) 20 (3), pp. 225–231. Cited by: §4.1.
  • [18] O. Ronneberger, P. Fischer, and T. Brox (2015) U-Net: convolutional networks for biomedical image segmentation. In Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Vol. , pp. 234–241. Cited by: §1.
  • [19] O. Slizovskaia, L. Kim, G. Haro, and E. Gomez (2019) End-to-end sound source separation conditioned on instrument labels. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 306–310. Cited by: §2.
  • [20] E. Vincent, R. Gribonval, and C. Févotte (2006) Performance measurement in blind audio source separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 14 (4), pp. 1462–1469. Cited by: §4.2.
  • [21] Y. Zhang, J. Hu, Y. Zhang, B. Pardo, and Z. Duan (2020) Vroom!: a search engine for sounds by vocal imitation queries. In Proc. Conference on Human Information Interaction and Retrieval (CHIIR), Vol. , pp. 23–32. Cited by: §1.
  • [22] Y. Zhang, B. Pardo, and Z. Duan (2019) Siamese style convolutional neural networks for sound search by vocal imitation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (2), pp. 429–441. Cited by: §1.
  • [23] Y. Zhang and Z. Duan (2016) IMISOUND: an unsupervised system for sound query by vocal imitation. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 2269–2273. Cited by: §1.