Deep Long Audio Inpainting

11/15/2019 ∙ by Ya-Liang Chang, et al. ∙ 0

Long (> 200 ms) audio inpainting, to recover a long missing part in an audio segment, could be widely applied to audio editing tasks and transmission loss recovery. It is a very challenging problem due to the high dimensional, complex and non-correlated audio features. While deep learning models have made tremendous progress in image and video inpainting, audio inpainting did not attract much attention. In this work, we take a pioneering step, exploring the possibility of adapting deep learning frameworks from various domains inclusive of audio synthesis and image inpainting for audio inpainting. Also, as the first to systematically analyze factors affecting audio inpainting performance, we explore how factors ranging from mask size, receptive field and audio representation could affect the performance. We also set up a benchmark for long audio inpainting. The code will be available on GitHub upon accepted.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Audio inpainting is of significant importance in a broad range of applications to fill in audio gaps of different scales.

Gaps of several to hundreds milliseconds often take place during transmission where packets are subject to frequent events of loss due to unreliable communication channel. Lots of research has been dedicated to packet loss during transmission and had success tackling gaps at the scale of milliseconds. For small rates of lost data, sparsity-based [Adler et al.2011, Siedenburg, Dörfler, and Kowalski2013] sinusodial-based [Lagrange, Marchand, and Rault2005], and autoregressive [Oudre2018] methods are proposed. And for situations with high packet loss rates in speech, [Bahat, Schechner, and Elad2015] proposed using an example-based method that exploits prior information from the same user to fill in the gaps.

Figure 1: Illustration of the long audio inpainting problem. Given a sound clip with part of it being masked out ( 200 ms), the goal is to recover the masked part. Audio inpainting could be done on either raw waveform (left) or spectrogram (right). Long audio inpainting could be widely used for sound editing tasks such as swear words removal, music editing, etc.

Larger gaps spanning for seconds could happen in various applications and cases, such as in music enhancement and restoration. [Perraudin et al.2018] identifies the rather unrealistic assumption often made during shorter-range inpainting that the signal is stationary, which tends not to hold for longer gaps. They harness a similarity graph to obtain similarities between segments and enable second-scale gap filling by substituting the most suitable segment for the gap.

Though these methods have shown quite successes at multiple gap scales, to the best of our knowledge, none have tailored for audio editing, where user could mask out an unwanted segment of an audio, expecting the restoration to sound natural and meaningful (in cases of speech). While audio editing could be utilized to a broad range of applications, such as removal of environmental noises in a speech or removal of human sound during bird sound recordings, we show current algorithms that targets at second-scale gaps, such as [Perraudin et al.2018] fails when applied onto the scenario111We do not compare with methods for packet loss since the scale difference is too large. (cf. Table 2).

Long ( 200 ms) audio inpainting for editing is a very challenging task. Firstly, gaps are commonly at the scale of seconds, rendering the algorithms for shorter gaps in vain. Secondly, in cases of speech, signals are mostly aperiodic and thereby invalidating example based methods such as [Perraudin et al.2018]. Thirdly, data are in high dimension (16k/sec) and the correlation between neighboring samples is rather low and thus directly applying state-of-the-art models in image or video inpainting tends not to work well.

Also, while image inpainting has been extensively and explored and promising results based on deep learning frameworks have been demonstrated on large mask inpainting, only a few papers [Marafioti et al.2018]

have experimented deep learning on long audio inpainting, let alone discussing how different factors of a neural network could affect the inpainting performance.

Hence, in this work, we take a pioneering step toward long audio inpainting for editing purpose and beyond. As the first to explore the problem, we survey and experiment models from various domains such as image inpainting, Deep Image Prior [Ulyanov, Vedaldi, and Lempitsky2018] and audio synthesis [Prenger, Valle, and Catanzaro2019]. We also propose two novel frameworks for unconstrained audio inpainting, where we systematically probe into how and to what extent various factors such as gap size, audio representation (either in waveform or spectrogram), receptive field and convolution type (dilated and gated convolution) could impact the inpainting performance, Also, we setup a benchmark for audio inpainting evaluation and hope it could facilitate future research in this domain.

Our contributions could be summarized as follows:

  • We setup a benchmark for long audio inpainting and compare different baselines, based on SC09 dataset of human voice and ESC-50 dataset of natural sound.

  • We survey and evaluate the possibility of adapting models from different domains for audio inpainting.

  • We designed novel waveform-based and spectrogram-based models for long audio inpainting.

  • We experimented different components for deep long audio inpainting, including kernel sizes and model layers.

Related Work

Image and video inpainting.

Inpainting models aim to restore the masked areas in the image/video, which could be widely used in image/video editing, such as object removal [Criminisi, Perez, and Toyama2003, Chang, Yu Liu, and Hsu2019]. The masked areas are usually given, either by human annotation or segmentation models. The masked area could be a bounding box [Wang et al.2019, Yu et al.2018b], an object [Huang et al.2016] or in arbitrary shape [Yu et al.2018a, Liu et al.2018, Chang, Liu, and Hsu2019]. Many approaches have been proposed to address the inpainting problems, such as diffusion-based ones [Bertalmio et al.2000, Bertalmio, Bertozzi, and Sapiro2001] and patch-based ones [Barnes et al.2009, Huang et al.2016]. In recent years, deep learning methods become dominant approaches for image inpainting [Yu et al.2018a, Nazeri et al.2019] and video inpainting [Kim et al.2019, Chang et al.2019] due to the ability to recover unseen parts in an image based on learned data distribution during training. As a baseline, we fine-tune state-of-the-art image inpainting model [Wang et al.2018] to recover missing parts on spectrogram for audio inpainting.

Apart from trained deep image inpainting frameworks, Deep Image Prior [Ulyanov, Vedaldi, and Lempitsky2018] offers a way to utilize the underlying structure in a untrained network for image restoration and demonstrates a promising result. We also considers it as one of our baselines.

Audio inpainting.

is to fill gaps in audios, which has been extensively explored under different terminologies [Smaragdis, Raj, and Shashanka2009, Wolfe and Godsill2005]. Many work [Marafioti et al.2018, Bahat, Schechner, and Elad2015] dedicates to gaps at the scale of several to tens of milliseconds that are due to packet loss in VoIP, clicks and impulsive noises. In these literature, gaps are at the scale of tens of milliseconds.

For gaps ranging from hundreds of milliseconds to several seconds, [Bahat, Schechner, and Elad2015] utilizes the statistics of recordings from the same user to perform inpainting and [Perraudin et al.2018] proposes to use a graph to capture spectral similarity of different segments in the signal, where the most suitable one is used for inpainting. Nevertheless, [Perraudin et al.2018] is only practical for signals with repeated patterns (e.g. music) and tends to fail on aperiodic signals like speech (cf. Table 2), while [Bahat, Schechner, and Elad2015] could only handle speech with the same identities. We still set [Perraudin et al.2018] as one of the baselines since [Bahat, Schechner, and Elad2015] is not suitable for datasets like ESC-50 [Piczak] for audios inside are all natural sounds.

While there is also work on speech inpainting [Prablanc et al.2016], we do not compare with it as it requires text to perform inpainting.

Audio synthesis.

is to generate audios either unconditionally or based on given cues. A pioneering work is WaveNet [Oord et al.2016] which achieves longer-range dependency with enlarged receptive fields through dilated convolution. Yet, one drawback for direct generation of audio samples through auto-regressive structures is its low speed.

Hence, many work have since built upon it to improve the generation speed. Still under auto-regressive structure, WaveRNN [Kalchbrenner et al.2018] substitutes RNN for the stack of convolutions in [Oord et al.2016]. Another prevalent approach is to generate an intermediate spectrogram before converting it to the final audio [Prenger, Valle, and Catanzaro2019, Donahue, McAuley, and Puckette2018].

Figure 2: Overall model architecture. (a) Waveform inpainting model takes masked raw waveform and mask as input and directly generates inpainted waveform. (b) Spectrogram inpainting model first transforms masked waveform to spectrogram, inpaints it as an image, and then transform it back to waveform with the Griffin-Lim algorithm. L1 loss is calculated between (a3) and (c1), while perceptual losses are derived using pre-trained models (SoundNet/ResNet50) to extract features for waveform (a3, c1) / spectrogram (b3, c2). Note that we experiment different design of (a2/b2) in the ablation study.

Though the goal of audio synthesis is different than that of audio inpainting, they could both generate audios conditionally. Hence, we consider Waveglow [Prenger, Valle, and Catanzaro2019] as one of the baselines and train the vocoder to generate inpainted audio given a masked spectrogram instead of a complete one.

Proposed Method

Definition

For audio inpainting, we take an input audio sequence with a mask as input. The masked samples are set to be zeros. The model will recover the masked samples and generate the output audio , and the goal is to minimize the loss between and .

Spectrogram Inpainting Model

For our spectrogram inpainting models, we first transform each the into a spectrogram

by short-time Fourier transform (STFT) with window width

, treat it as a special image and thereby considering audio inpainting problem as a special image inpainting problem to recover the missing parts in the spectrogram. Then the recovered spectrogram will be transformed back to waveform as by Griffin-Lim algorithm [Griffin and Lim1984] for comparison.

The spectrogram inpainting model architecture is based on state-of-the-art image inpainting model [Yu et al.2018a] (see Fig. 2 (b2)). However, unlike natural images where x and y dimensions have similar scale and meaning, the time and frequency dimension in spectrograms have a significant difference. Therefore, we explore different components to deal with convolutions on spectrogram (see Fig. 2).

Waveform Inpainting Model

Our waveform inpainting models directly takes masked raw waveform as input and generate recovered waveforms as outputs. However, different from spectrograms, the raw waveforms are in much higher dimension (16k/sec). If we have a one-second audio clip at a sample rate of 16 kHz, over 61 samples are needed to capture a single cycle of the 261.63 Hz sinusoid, C4 of the musical note. As discussed in the previous works [Aytar, Vondrick, and Torralba2016, Oord et al.2016, Donahue, McAuley, and Puckette2018]

, larger convolutional kernels and strided/dilated convolutions are often needed to deal with audio signals as they could increase the receptive field. On the proposed waveform inpainting model, we thus experiment with gated/dilated convolution,

Gated convolutions.

For each convolutional layer in waveform-based models (Fig. 2 (a2)), we adopt gated convolution [Yu et al.2018a] to softly attend on the masked areas:

(1)

where is the input feature, is the gating kernel, is the feature kernel,

is the sigmoid function to restrict the soft gating values between 0 (invalid) and 1 (valid),

is the activation function (LeakyReLU), and

is the convolution operation. Note that a similar idea, gated activation [Van den Oord et al.2016] is also found useful for the audio generation task such as WaveNet [Oord et al.2016].

Loss Functions

Masked loss ().

The loss focuses on low-level features and is widely used for both image and video inpainting models [Liu et al.2018, Wang et al.2019, Chang, Liu, and Hsu2019]. We apply the loss on the masked area:

(2)
Perceptual loss on waveforms.

loss often leads to blurry results [Yu et al.2018a, Chang, Liu, and Hsu2019], so we adopt the perceptual loss [Gatys, Ecker, and Bethge2015] originally used for style transfer to enhance the audio quality. It is also used for image inpainting [Liu et al.2018, Yu et al.2018a], video inpainting [Chang, Liu, and Hsu2019]

and super-resolution

[Johnson, Alahi, and Fei-Fei2016, Ledig et al.2017].

Similar as pre-trained VGG [Simonyan and Zisserman2014]

on ImageNet

[Russakovsky et al.2015] for image perceptual loss, we use pre-trained SoundNet and fine-tune it on our benchmark dataset with classification task for audio perceptual loss:

(3)

where

is the features extracted from last layer before fully-connected of the fine-tuned SoundNet. Note that we follow a similar fashion as how

[Chen et al.2018] uses a pretrained SoundNet to compute audio perceptual loss.

Perceptual loss on spectrograms.

Aside from waveforms, we propose to consider perceptual loss on spectrograms, as image classification is relatively easier to learn. We transform waveforms to spectrograms with STFT, which are then used to fine-tune a ResNet50 [He et al.2016] pretrained on ImageNet [Russakovsky et al.2015] for audio classification. The fine-tuned ResNet50 then serves as a feature extractor for perceptual loss on spectrograms, as shown in Equation 3.

Experimental Results

Datasets

Sc09.

SC09 dataset is a subset of the Speech Commands Dataset [Warden2018] that contains single spoken word from zero to nine by different speakers in uncontrolled environments. Since its first proposal by [Donahue, McAuley, and Puckette2018], it has been used in many audio generation research [Donahue, McAuley, and Puckette2018, Marafioti et al.2019] and often regarded as the most common baseline in the area. (just as MNIST dataset [LeCun, Cortes, and Burges1998] in written digit recognition, although examples in SC09 are more complicated () than MNIST ())

Esc-50.

ESC-50 dataset [Piczak] is a labeled dataset for environmental sound classification, including 2000 5-second long environmental audio recordings of 50 semantic classes (40 examples per class) from 5 categories: animals, natural soundscapes & water sounds, human non-speech sounds, interior/domestic sounds and exterior/urban noises. Compared to SC09, examples in ESC-50 are more repetitive and thus easier for patch- and example-based methods but harder for learning-based methods for it has more classes and fewer data per class.

Benchmark Procedure

We setup the long audio inpainting benchmark to compare baselines, including WaveGlow [Prenger, Valle, and Catanzaro2019], SimilarityGraph [Perraudin et al.2018], DeepPrior [Ulyanov, Vedaldi, and Lempitsky2018] and GMCNN [Wang et al.2018]. For SC09, we train all the models on the whole training set with random masking of 0.2 second and without any data augmentation. We perform evaluation on the testing set with fixed mask from 0.4 0.6 second. For the ESC-50 dataset, we train models with the first 1600 sound clips (first to fourth fold) with random masks of 0.4 second. Sound clips are copied twice to 10 seconds and then randomly cropped to 5 seconds during training. The first 200 sound clips of the fifth fold is used for validation while testing is done on the other 200 sound clips with fixed mask from 3.0

3.4 second. For models that require longer inputs, we apply zero padding. Note that after inpainting, we paste the unmasked segments from input to the output.

Baseline Implementation Details

WaveGlow

is a flow-based vocoder that transforms a melspectrogram to its corresponding final waveform. It combines essence of WaveNet and Glow and directly learns the data distribution. We modify it to take a masked melspectrogram instead of a complete one as input, using the codes provided by NVidia222https://github.com/NVIDIA/waveglow

. We train each model for 100000 epochs with batch size 2 and 4 for SC09 and ESC-50 respectively.

Deep Image Prior

performs well on several image restoration tasks including inpainting by simply using the structure of a neural network and the corrupted image without any training beforehand. We harness the inpainting script in Github333https://github.com/DmitryUlyanov/deep-image-prior to inpaint the masked spectrogram. We change the target iteration from 6001 to 4001 to reduce the processing time while still maintaining the quality of the audio.

Gmcnn

is one of the state-of-the-art image inpainting framework that features a multi-column neural network that could model different image components and extract multi-level features to aid inpainting.

We experiment with the framework provided in Github 444https://github.com/shepnerd/inpainting˙gmcnn

and modify the model to take a masked spectrogram as input instead of a 256 * 256 image with RGB channels. Since the spectrogram is one channel, we firstly modify the pretrained model by changing the first layer of the generator to a conv layer with one input channel, the last decoding layer to a conv layer with one output channel and the first layer of both the global and local discriminator to a conv layer with one input channel. These conv layers are all initialized randomly with normal distribution. We then finetune the model for 40 epochs using the default settings.

SimilarityGraph

is an example-based framework that targets particularly at long gaps in music. It detects spectro-temporal similarities among unmasked data to the masked area, solving case when adjacent segments fail to provide a solution.

We harness the demo website555 https://epfl-lts2.github.io/rrp-html/audio˙inpainting/ to perform inpainting. Since their framework requires that the mask to be at least 3 seconds away from the start and the end of the audio, we perform duplication before uploading audio to the website. For SC09, we duplicate both the front and rear 0.4 second for 8 times, generating an audio that is 0.4 * 16 + 0.2 = 6.6 second long (Hence, the mask is from 3.2 to 3.4 second). And for ESC-50, we duplicate only the rear 1.6 second twice, generating an audio that is 3 + 0.4 + 1.6 * 2 = 6.6 second long (Hence, the mask is from 3 t o 3.4 second). After inpainting, we extract the segment that corresponds to the original audio. Note that since the algorithm replaces the masked part with a audio segment from the same signal, the position of the original audio might shift slightly.

Evaluation Metrics

To evaluate different methods numerically, we calculate the masked error (Eq. 2) and perceptual distance (Eq. 3) between the outputs and ground truths on waveforms and sepctrograms, as explained in Sec. Loss Functions. For fair comparison of perceptual distance on waveforms, we finetune another SoundNet [Aytar, Vondrick, and Torralba2016] and VGG16 [Simonyan and Zisserman2014] for perceptual distance. Please see Table 1 for detailed settings of different backbones (pre-) trained on SoundNet and ResNet50 for perceptual loss on waveforms, spectrograms respectively. In addition, we also report the structural similarity (SSIM) index [Wang et al.2004] on spectrograms. The inference time is reported in terms of how many sound clip could be processed per second on Intel(R) Xeon(R) Gold 6154 CPU 3.00GHz with a single V100 GPU.

Dataset
Type
Model
Pretr.
Param.
Acc.
SC09 Wave. SoundNet 14.3M 93.4%
SoundNet 14.3M 91.0%
Spec. VGG16 134.3M 96.0%
ResNet50 23.5M 96.3%
ResNet50 23.5M 94.7%
ESC-50 Wave. SoundNet 14.7M 66.3%
SoundNet 14.7M 61.0%
Spec. VGG16 134.4M 83.5%
ResNet50 23.6M 82.0%
ResNet50 23.6M 77.0%
Table 1: Sound classification accuracy for perceptual losses/metrics on SC09 and ESC-50 testing set.
Figure 3: Spectrograms of audio inpainting results. SimilarityGraph performs well as sounds in ESC-50 tend to embed repetitive structures for example-based method to exploit. Also, we show that despite its long processing time, DeepPrior also excels at spectrogram inpainting aside from image inpainting.

Quantitative Results

Sound classification.

For perceptual losses evaluation, we finetune SoundNet or train ResNet from scratch on audio classification with cross entropy for waveform and spectrogram respectively. The classification accuracy of each model is reported in Table 1. We could observe that for ESC-50, spectrogram based classification models outperform those based on waveform, while in SC09, this does not hold.

This might be caused by the input dimension. That is, raw waveforms in ESC-50 have more samples (22050 5 = 110250 samples) than those in SC09, making it harder for models based on 1-d convolutions with limited receptive field size to extract high-level features. On the contrary, after STFT, spectrograms are only in 1024 400 and 1024 80 for ESC-50 and SC09 respectively, which are reasonable sizes for image classification models and the size difference is smaller for the two datasets.

Another interesting point is that the pre-trained weights on ImageNet boost spectrogram classification for about 5%, even though the applications and domains are quite different (3-channels natural images vs 1-channel spectrograms). It possibly imply that pre-training on other datasets such as Google Audio set [Gemmeke et al.2017] could further improve the perceptual losses and metrics.

SC09 ESC-50
Method ML1 SSIM
Wave.
P. Dist.
Spec.
P. Dist.
ML1 SSIM
Wave.
P. Dist.
Spec.
P. Dist.
Infer.
Speed*
Masked Input 0.011125 0.675040 0.006231 0.079655 0.067510 0.648608 0.007740 0.542866
Griffin-Lim GT 0.021293 0.808092 0.004214 0.021372 0.004539 0.978142 0.000784 0.007324
WaveRNN
WaveGlow 0.013689 0.730689 0.004494 0.077139 0.003048 0.929394 0.000821 0.035776 2.058
Sim.Graph 0.004478 0.697933 0.003229 0.115829 0.039
GMCNN 0.010769 0.695439 0.004866 0.073790 0.002738 0.935945 0.000728 0.031737 63.09
DeepPrior 0.010940 0.719634 0.004607 0.067535 0.004175 0.951980 0.000499 0.017755 0.002
Ours (Spec.)
(L1)
0.012073 0.422832 0.008016 0.080591 0.003148 0.727943 0.002362 0.154318 106.38
Ours (Spec.)
(L1+SpecP)
0.017605 0.384210 0.006532 0.066645 0.003159 0.721920 0.002334 0.152135 106.38
Ours (Wave.)
(L1)
0.010860 0.696274 0.005817 0.071622 0.002696 0.923965 0.000888 0.035509 92.93
Ours (Wave.)
(L1+WaveP)
0.013796 0.775181 0.002909 0.051784 0.002931 0.923112 0.000859 0.035125 92.93
Table 2: Long audio inpainting benchmark results. : WaveRNN fails to converge; SimilarityGraph algorithm fails to find a solution for most cases in SC09. *The infer. speed is how many SC09 samples an algorithm could process in a minute.
Long audio inpainting benchmark results.

The long audio inpainting benchmark results are presented in Table 2

. We could observe that our models perform reasonably well on both SC09 and ESC-50 for all metrics. Still, we find that all the evaluation metrics could not totally reflect human perception. For example, the STFT + Griffin-Lim process would seriously damage the SSIM score even when the input is simply ground truth (see the Griffin-Lim GT column); the perceptual distances are not affected by the process, but it may not totally reflect the amplitude (see Fig.

3: GMCNN has low perceptual distance). On the other hand, although results from SimlarityGraph are quite natural to humans (since the mask is pasted with the other part of the sound clip), its performance is not as good in all metrics as the filled in contents are different. Surprisingly, image inpainting models GMCNN and DeepPrior outperform other baselines in all metrics (note that DeepPrior does not require training), whereas vocoders like WaveGlow and WaveRNN are not as good. It indicates that general image inpainting models could highly likely be adapted to handle spectrogram as well and our spectrogram-based model still have a large space to improve, such as the kernel size, training loss, etc. Also, though the perceptual loss we apply does help a little bit, it generally does not lead to large improvement.

Qualitative Results

We compare output spectrograms from different baselines qualitatively in Fig. 3. The spectrograms show a sound of water filling a container in five seconds with a 0.4 second mask at three second. The sound of Griffin-Lim GT has no mask and thus depicts how the spectrogram would look like after undergoing Griffin-Lim algorithm.

In baselines, we discover that SimilarityGraph and DeepPrior perform well on inpainting the water sound. The environmental sounds in ESC-50 contain a lot of repeating structures. Due to this reason, the result of SimilarityGraph intuitively sounds great by with its copy and paste solution. Note that SimilarityGraph fails on most SC09 cases, as there are no repetitive structures that could be pasted in cases of zero to nine. With results from DeepPrior, which is good at extrapolating local correlation, we show that sounds, like images, have local property as well.

We found that DeepPrior does surprisingly well on audio inpainting, extrapolating implicit structures embedded in spectrograms. WaveGlow inpaints with sheer noise and GMCNN fails and inpaints sheer silence.

Our proposed method inpaints meaningful elements instead of sheer noise or pure silence in the masked part, as shown in Fig. 3. Compared to baselines like DeepPrior and SimilarityGraph, where results are more clear and natural than that our results. Nevertheless, the DeepPrior needs much more inference time than other else and the SimilarityGraph highly constrains on specific tasks due to its copy and paste solution.

Ablation Study

In this section, we experiment with different parameters, inclusive of mask ratio and receptive field to explore how these factors may affect our model. Note that we perform all the following experiments on ESC-50.

We perform two set of experiments. In the first one, we fix the length of the mask and alter the receptive field by configuring the network architecture. And in the second experiment, we fix the receptive field and see how different mask sizes actually impact the performance.

Masked
Time (s)
Masked
Field
Receptive
Field
L1
loss.
SpecP
Error
Suc.
0.1 40 21 0.0272 0.131
0.1 40 29 0.0236 0.114
0.1 40 45 0.0217 0.101
0.1 40 61 0.0216 0.105
0.1 40 77 0.0209 0.108
0.1 40 93 0.0206 0.098
0.1 40 109 0.0214 0.117
0.1 40 125 0.0218 0.101
0.15 60 77 0.0338 0.1571
0.16 64 77 0.0380 0.1754
0.17 68 77 0.0446 0.2102
0.18 72 77 0.0447 0.2003
0.19 76 77 0.0480 0.2305
0.2 80 77 0.0552 0.2390
0.25 100 77 0.0676 0.2855
Table 3: L1 loss and Perceptual error of different masked time(sec) and receptive field on ESC-50. The masked field and receptive field are based on time axis. Error is calculated on validation set. Success presents whether the model successfully inpainted the whole masked part or failed on certain field. Fail means no change on the magnitude of inpainting part lasting a period even the model inpainted most of the masked field.

We evaluate on models that are trained for 50 epoch with L1 loss and Spectrogram perceptual metrics (see Table 3).

Receptive field.

In our proposed structure, increasing the depth of the network enlarges the receptive field. According to our experiment results, the receptive field has to be larger than a certain threshold in order to inpaint the whole mask. Nevertheless, after reaching a certain size, enlarging the receptive field has little benefit or even negative effect for training. That indicates, after some threshold, our model is complicated enough and is able to restore the mask.

Mask ratio.

We train several models by altering the mask length from 0.1 to 0.25 second and keep the receptive field fixed. We found that our model could adapt to different mask lengths (from 0.1 to 0.16 seconds), with a fixed receptive field, as long as the mask size is smaller or equals to receptive field. This, on the other side, again confirms that the receptive field has to be at least similar to the mask size to perform successful inpainting. Also by observing the failure cases, the inpainted sound at the mask position trailed off at first, vanished at the middle, and then appeared in the end. It indicates that if the mask field is too big, the receptive field will not be sufficient to gather enough information to rebuild the whole masked part.

Discussion and Future Work

Receptive field and model architecture.

To the best of our knowledge, we are the first to evaluate different baselines and model architectures for deep long audio inpainting We also discuss the effect of different mask ratios and receptive field in the ablation study. However, compared to image classification/inpainting, there is very little research working on the model architecture of deep audio tasks. Current architectures are very diverse, including different stride/dilation/kernel sizes, while ESC-50 is not large and diverse enough as ImageNet for comparison. Further experiments could be done to find out a common structure for audio perceptual loss and waveform/spectrogram based audio inpainting, possibly through neural architecture search [Zoph and Le2016].

More general datasets or datasets with other clues.

For the proposed benchmark, we compare methods on SC09 and ESC-50, corresponding to complicated short human voices and repetitive natural sounds. Nevertheless, in real-world scenarios, there are many more kind of sounds with longer periods and more complex/simple structures, such as speech and music. Our benchmark currently does not cover enough datasets for general audio editing. Also, in many cases other clues such as texts, images and videos are given at the same time, which could possibly assist long audio inpainting.

Spectrograms to waveforms.

In this work, we apply the Griffin-Lim algorithm to turn spectrograms back to waveforms as in the audio synthesis [Donahue, McAuley, and Puckette2018] and text-to-speech [Tachibana, Uenoyama, and Aihara2018] task. The reconstructed waveforms are similar to the original ones but with a slight loss (see the Griffin-Lim GT in Table 2

). It’s worth mentioning different from the two tasks, most phases in long audio inpainting are available and could be used for better phase estimation of the masked area to transform spectrograms back to waveforms. The model could learn better to recover the missing phases in the masked area with hints from the surrounding phases and thus better waveform reconstruction.

GAN loss.

Recently, many image/video inpainting [Yu et al.2018a, Chang, Liu, and Hsu2019] and audio synthesis works [Donahue, McAuley, and Puckette2018] adopt the generative adversarial network (GAN) [Goodfellow et al.2014]

to enhance output realness. However, in our experiments, the GAN loss does not help much for our models. How to incorporate GAN and other loss functions to make output sounds more realistic is a possible future direction for audio inpainting.

Conclusion

In this paper, we built up the first benchmark for long audio inpainting, which could flourish the audio editing tasks. We propose deep spectrogram-based and waveform-based audio inpainting models and compare with baselines from related research. Our model is learning based and could recover long audio mask with superior performance quantitatively and qualitatively against baseline methods on both SC09 and ESC-50. We also explore the affect of different mask ratios and model architecture, and discuss possible future direction for long audio inpainting.

References

  • [Adler et al.2011] Adler, A.; Emiya, V.; Jafari, M. G.; Elad, M.; Gribonval, R.; and Plumbley, M. D. 2011. Audio inpainting. IEEE Transactions on Audio, Speech, and Language Processing 20(3):922–932.
  • [Aytar, Vondrick, and Torralba2016] Aytar, Y.; Vondrick, C.; and Torralba, A. 2016. Soundnet: Learning sound representations from unlabeled video. In Advances in neural information processing systems, 892–900.
  • [Bahat, Schechner, and Elad2015] Bahat, Y.; Schechner, Y. Y.; and Elad, M. 2015. Self-content-based audio inpainting. Signal Processing 111:61–72.
  • [Barnes et al.2009] Barnes, C.; Shechtman, E.; Finkelstein, A.; and Goldman, D. B. 2009. Patchmatch: A randomized correspondence algorithm for structural image editing. In ACM Transactions on Graphics (ToG), volume 28,  24. ACM.
  • [Bertalmio, Bertozzi, and Sapiro2001] Bertalmio, M.; Bertozzi, A. L.; and Sapiro, G. 2001. Navier-stokes, fluid dynamics, and image and video inpainting. In CVPR, volume 1, I–I. IEEE.
  • [Bertalmio et al.2000] Bertalmio, M.; Sapiro, G.; Caselles, V.; and Ballester, C. 2000. Image inpainting. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, 417–424. ACM Press/Addison-Wesley Publishing Co.
  • [Chang et al.2019] Chang, Y.-L.; Liu, Z. Y.; Lee, K.-Y.; and Hsu, W. 2019. Learnable gated temporal shift module for deep video inpainting. In BMVC.
  • [Chang, Liu, and Hsu2019] Chang, Y.-L.; Liu, Z. Y.; and Hsu, W. 2019. Free-form video inpainting with 3d gated convolution and temporal patchgan. In ICCV.
  • [Chang, Yu Liu, and Hsu2019] Chang, Y.-L.; Yu Liu, Z.; and Hsu, W. 2019. Vornet: Spatio-temporally consistent video inpainting for object removal. In CVPR, 0–0.
  • [Chen et al.2018] Chen, K.; Zhang, C.; Fang, C.; Wang, Z.; Bui, T.; and Nevatia, R. 2018. Visually indicated sound generation by perceptually optimized classification. In ECCV, 0–0.
  • [Criminisi, Perez, and Toyama2003] Criminisi, A.; Perez, P.; and Toyama, K. 2003. Object removal by exemplar-based inpainting. In CVPR, volume 2, II–II. IEEE.
  • [Donahue, McAuley, and Puckette2018] Donahue, C.; McAuley, J.; and Puckette, M. 2018. Adversarial audio synthesis. arXiv preprint arXiv:1802.04208.
  • [Gatys, Ecker, and Bethge2015] Gatys, L. A.; Ecker, A. S.; and Bethge, M. 2015. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576.
  • [Gemmeke et al.2017] Gemmeke, J. F.; Ellis, D. P.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R. C.; Plakal, M.; and Ritter, M. 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 776–780. IEEE.
  • [Goodfellow et al.2014] Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672–2680.
  • [Griffin and Lim1984] Griffin, D., and Lim, J. 1984. Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing 32(2):236–243.
  • [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770–778.
  • [Huang et al.2016] Huang, J.-B.; Kang, S. B.; Ahuja, N.; and Kopf, J. 2016. Temporally coherent completion of dynamic video. ACM Transactions on Graphics (TOG) 35(6):196.
  • [Johnson, Alahi, and Fei-Fei2016] Johnson, J.; Alahi, A.; and Fei-Fei, L. 2016. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 694–711. Springer.
  • [Kalchbrenner et al.2018] Kalchbrenner, N.; Elsen, E.; Simonyan, K.; Noury, S.; Casagrande, N.; Lockhart, E.; Stimberg, F.; Oord, A. v. d.; Dieleman, S.; and Kavukcuoglu, K. 2018. Efficient neural audio synthesis. arXiv preprint arXiv:1802.08435.
  • [Kim et al.2019] Kim, D.; Woo, S.; Lee, J.-Y.; and Kweon, I. S. 2019. Deep video inpainting. In CVPR, 5792–5801.
  • [Lagrange, Marchand, and Rault2005] Lagrange, M.; Marchand, S.; and Rault, J.-B. 2005.

    Long interpolation of audio signals using linear prediction in sinusoidal modeling.

    Journal of the Audio Engineering Society 53(10):891–905.
  • [LeCun, Cortes, and Burges1998] LeCun, Y.; Cortes, C.; and Burges, C. 1998. Mnist dataset. URL http://yann. lecun. com/exdb/mnist.
  • [Ledig et al.2017] Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. 2017. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 4681–4690.
  • [Liu et al.2018] Liu, G.; Reda, F. A.; Shih, K. J.; Wang, T.-C.; Tao, A.; and Catanzaro, B. 2018. Image inpainting for irregular holes using partial convolutions. arXiv preprint arXiv:1804.07723.
  • [Marafioti et al.2018] Marafioti, A.; Perraudin, N.; Holighaus, N.; and Majdak, P. 2018. A context encoder for audio inpainting. arXiv preprint arXiv:1810.12138.
  • [Marafioti et al.2019] Marafioti, A.; Holighaus, N.; Perraudin, N.; and Majdak, P. 2019. Adversarial generation of time-frequency features with application in audio synthesis. arXiv preprint arXiv:1902.04072.
  • [Nazeri et al.2019] Nazeri, K.; Ng, E.; Joseph, T.; Qureshi, F.; and Ebrahimi, M. 2019. Edgeconnect: Generative image inpainting with adversarial edge learning.
  • [Oord et al.2016] Oord, A. v. d.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; and Kavukcuoglu, K. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
  • [Oudre2018] Oudre, L. 2018.

    Interpolation of missing samples in sound signals based on autoregressive modeling.

    Image Processing On Line 8:329–344.
  • [Perraudin et al.2018] Perraudin, N.; Holighaus, N.; Majdak, P.; and Balazs, P. 2018. Inpainting of long audio segments with similarity graphs. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26(6):1083–1094.
  • [Piczak] Piczak, K. J. ESC: Dataset for Environmental Sound Classification. In MM, 1015–1018. ACM Press.
  • [Prablanc et al.2016] Prablanc, P.; Ozerov, A.; Duong, N. Q.; and Pérez, P. 2016. Text-informed speech inpainting via voice conversion. In 2016 24th European Signal Processing Conference (EUSIPCO), 878–882. IEEE.
  • [Prenger, Valle, and Catanzaro2019] Prenger, R.; Valle, R.; and Catanzaro, B. 2019. Waveglow: A flow-based generative network for speech synthesis. In ICASSP, 3617–3621. IEEE.
  • [Russakovsky et al.2015] Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. 2015. Imagenet large scale visual recognition challenge.

    International journal of computer vision

    115(3):211–252.
  • [Siedenburg, Dörfler, and Kowalski2013] Siedenburg, K.; Dörfler, M.; and Kowalski, M. 2013. Audio inpainting with social sparsity. SPARS (Signal Processing with Adaptive Sparse Structured Representations).
  • [Simonyan and Zisserman2014] Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
  • [Smaragdis, Raj, and Shashanka2009] Smaragdis, P.; Raj, B.; and Shashanka, M. 2009.

    Missing data imputation for spectral audio signals.

    In

    2009 IEEE International Workshop on Machine Learning for Signal Processing

    , 1–6.
    IEEE.
  • [Tachibana, Uenoyama, and Aihara2018] Tachibana, H.; Uenoyama, K.; and Aihara, S. 2018. Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. In ICASSP, 4784–4788. IEEE.
  • [Ulyanov, Vedaldi, and Lempitsky2018] Ulyanov, D.; Vedaldi, A.; and Lempitsky, V. 2018. Deep image prior. In CVPR, 9446–9454.
  • [Van den Oord et al.2016] Van den Oord, A.; Kalchbrenner, N.; Espeholt, L.; Vinyals, O.; Graves, A.; et al. 2016. Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, 4790–4798.
  • [Wang et al.2004] Wang, Z.; Bovik, A. C.; Sheikh, H. R.; Simoncelli, E. P.; et al. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4):600–612.
  • [Wang et al.2018] Wang, Y.; Tao, X.; Qi, X.; Shen, X.; and Jia, J. 2018.

    Image inpainting via generative multi-column convolutional neural networks.

    In Advances in Neural Information Processing Systems, 331–340.
  • [Wang et al.2019] Wang, C.; Huang, H.; Han, X.; and Wang, J. 2019. Video inpainting by jointly learning temporal structure and spatial details. In AAAI.
  • [Warden2018] Warden, P. 2018. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209.
  • [Wolfe and Godsill2005] Wolfe, P. J., and Godsill, S. J. 2005. Interpolation of missing data values for audio signal restoration using a gabor regression model. In ICASSP, volume 5, v–517. IEEE.
  • [Yu et al.2018a] Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; and Huang, T. S. 2018a. Free-form image inpainting with gated convolution. arXiv preprint arXiv:1806.03589.
  • [Yu et al.2018b] Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; and Huang, T. S. 2018b. Generative image inpainting with contextual attention. arXiv preprint.
  • [Zoph and Le2016] Zoph, B., and Le, Q. V. 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578.