Typically, a deep neural network distills robust priors from a large amount of labeled or unlabeled data (deng2009imagenet; jansen2018unsupervised). In audio research, neural networks such as VGGish (hershey2017cnn) and Wave-U-Net (stoller2018wave) have shown great success in audio classification, audio source separation, and many other challenging tasks. While large audio datasets have greatly improved supervised training, the collection and especially the cleaning of a large amount of audio data still remain an open challenge (fonseca2017freesound). For example, in AudioSet (gemmeke2017audio), one of the popular audio datasets, we often find audios/videos that contain content beyond what the label specifies111As an example, under “bark” from AudioSet, we find https://youtu.be/2mJbGx5D-zA?t=150 containing human speech, wind noises, and various audio effects other than bark. .
In this paper, we combine the power of deep neural networks and temporal prior in audio without any external training data. Similar ideas have been explored in deep image prior (DIP) (ulyanov2018deep). Double-DIP from gandelsman2018double further showed that it is possible to achieve robust unsupervised image decomposition from a single-image input, without pre-training on any data. Deep Audio Prior (DAP)’s capability to train on a single audio file has several advantages. First, with proper selection of the audio priors, we show that DAP generalizes well to a wide variety of unseen types of data. Second, our training process is fully unsupervised and therefore make it possible to pre-process large volumes of data in the wild. Last but not least, we show several novel applications that are only possible because of the unique features of DAP, including universal source separation, interactive editing, audio texture synthesis, and audio co-separation.
Domain gap between audio and visual images precludes direct adoption of the image priors. Many assumptions or priors that are true for images no longer hold for audio. By nature, audio signals exhibit strong temporal coherence, e.g. one’s voice changes smoothly (See in inset). Since images tend to have more spatial patterns, most existing deep image priors have focused on how to encapsulate the spatial redundancy. Another challenge specific to audio is the activation discontinuity. Unlike in videos where an object moves continuously in the scene, a sound source can make sound or turn complete silence at any given time (see in inset).
Our proposed deep audio prior framework has the following main contributions:
A temporally coherent source generator that can reproduce a wide spectrum of natural sound sources including human speech, musical instruments, animal sounds, etc.
A novel mask activation scheme to enable and disable sources with frequency binning and without temporal dependence.
We demonstrated the effectiveness of DAP in several challenging applications, including universal audio source separation, interactive mask-based editing, audio texture synthesis, and audio co-separation.
We also introduce Universal-150, a benchmark dataset on blind source separation for universal sound sources. We showed that DAP outperforms other blind source separation (BSS) methods in multiple numerical metrics.
2 Deep Audio Prior Framework
Deep Audio Prior is an unsupervised blind source separation framework. More specifically, DAP does not train on any extra data other than the input audio mixture. Similar to image foreground and background segmentation, audio blind two-source separation can be expressed as:
where and are two audio generator networks, and are two mask networks, and are sampled from random distributions. Our method works in Time-Frequency (T-F) spectrogram space (cohen1995time). All the input and output variables are of dimension
. Waveforms are transformed into spectrograms by Short-Time Fourier Transform (STFT) with a frame size of 1022 and hop size of 172, and audio sampling rate is 11000 Hz. Figure 1 shows an overview of our framework.
The intuition for our single-audio source separation is inspired by single-image decomposition: it is much easier for two generator networks ( and ) to learn two distinct sound sources respectively, rather than forcing one of the generators to learn the mixture. gandelsman2018double analyzed this in terms of complex mixture versus simple individual components. In audio, as shown in §3.1, we show that networks tend to quickly learn patterns from two distinct sources.
Let and , then we have . Clearly, when combining the separated sounds and , we should obtain the original sound mixture . The data fidelity term is expressed as a reconstruction loss, , which will push the combination of the two separated sounds to be close to the original sound mixture.
2.1 Temporally Coherent and Dynamic Source Generator
Temporally Coherent Source Generation
Sound signals from the same source across time would be similar (rosen1992temporal). To explicitly model the temporal property, we use multiple audio frames with temporal consistent noise as inputs for each sound source. We split the into frames along the time axis and obtain . Accordingly, we use noise input pairs to predict the corresponding sounds . For the rest of this paper, we will omit the underscript when there is no confusion, i.e. instead of .
To explicitly enforce the temporal coherence, we impose strong correlations on input noises:
whereinitialization of . Since we use shared networks and across different frames for predicting individual sounds and the noise inputs are temporally consistent, which will enforce the network to predict temporal consistent sounds. A similar idea was also adopted in (gandelsman2018double)
to preserve video coherence but they use the noise to predict masks. We also employ a temporal continuity loss function to further enforce it by pushing the absolute of gradients along time to be small(chambolle2004algorithm):
Dynamic Source Generation
Audio signals can also have dynamic patterns with large variations. Since values in are very small by construction, the temporally consistent noise can only handle small variations and is not capable of capturing temporally dynamic sound patterns. To preserve the temporally consistent patterns in predictions and also hallucinate dynamic patterns, in spirit of curriculum learning (bengio2009curriculum), we gradually add dynamic noise into inputs as we progress more training iterations:
where the denotes optimization iteration, and refers to a random Gaussian noise. Unlike the temporal consistent noise in (3) that have a constant as initialization, is independently sampled for each frame . To balance the and throughout our training iterations, we introduce a coefficient :
where and are two thresholds and . If , only a temporally shared constant noise is used; if , the dynamic sampled noise will be gradually added; if , only a dynamic noise is used. The noise input design will first let the model predict stable temporally consistent sound patterns and then push the model to capture large variations in spectrograms to reduce the reconstruction loss. Note that is not continuous with respect to (see red curve in inset). In practice, we found this abrupt change in improves the ability to capture dynamic audio change. See more discussion in §4.
Frequency Domain Exclusion
We assume that two different sounds and are dissimilar. To enforce the constraint, we utilize an exclusion loss function from zhang2018single in which it is formulated as the product of normalized gradient fields of the two sound predictions at spatial resolutions:
where and are regularizers, the spatial patch index, the Frobenius norm, denotes element-wise multiplication. We set , , and .
2.2 1D Mask Constraints for Audio
Sound sources will not always make sounds all the time, which would break the temporal consistency in the spectrogram domain. To address this issue, we introduce audio masks to activate sounding spectrum regions and deactivate silent regions.
If a source is sounding at a time, the spectrum bin at the time should be activated. Based on the observation, mask values within the same temporal bin should be consistent and binary. Namely, if a dog barking sound is present, it should appear across all frequency range under the same timestamp. Therefore, we force the mask within the same temporal bin to be consistent: where refers to -th sound source, , and are height (frequency axis) and width (time axis) of the spectrogram
. In our implementation, we use a max pooling operator to aggregate the mask content along frequency-coordinate and generate mask value
using a Sigmoid function.represents the mask for -th source at time . We define to be a continuous variable for ease of optimization and will enforce an extra binary loss term.
Furthermore, at any time, if there is a sound in , at least one of the masks should be activated. To this end, we introduce a nonzero mask loss term:
where = for numerical stability and is a margin value. The margin value is to ensure when the sum of mask activation is already larger than the , the loss will not continue to push the mask values. is to suppress this loss term if there is no sound in at time . Moreover, in our observations, we found that masks are either fully activated or fully deactivated. Very rarely would a sound source be at activation level. Therefore, we introduce a differential loss term to encourage the mask networks to generate binary masks: where again is used to avoid numerical issues. The will force the mask values to be as far away as possible to , which means they are close to either or .
2.3 Walkthrough on Synthetic Separation Examples
With all the pieces together, our total loss is defined as:
We optimize the networks in an end-to-end manner with all the loss functions. We empirically keep the weights for all loss terms the same, with the only exception being has a factor. To validate the implementation of our algorithm, we generate two types of synthetic spectrograms.
(i) Single-frequency band fake sounds
We generate a synthetic mixture where each of the two audio sources is producing a flat tone at a single frequency. For this test, since we know the spectrogram at different segment will always produce the same output (each sound source is just a flat bar), we set the input noise for sound source 1 and the same for source 2. The top two rows from Figure 2 show that our DAP achieves perfect prediction on both the source generators and masks.
(ii) Curved input sounds
Our next test is to validate if our DAP framework can handle temporally coherent spectrogram changes. Two shifted cosine curves are combined together as input to our separation pipeline (see bottom two rows in Figure 2). We set the input noise according to equation (3). Again, our DAP framework achieves the desired separation of sources and masks.
Next, we present several challenging applications with DAP. For the best experience, please visit our webpage222Our anonymous submission page: https://iclr-dap.github.io/Deep-Audio-Prior/ to listen to our audio results. Unless specified otherwise, all the experiments for the same application are run with the same set of parameters.
3.1 Universal Blind Source Separation
Given a sound mixture , universal blind source separation aims to separate individual sounds from the sound mixture without using any external data. Universal blind separation is challenging because the input audio can be in arbitrary domain, not just commonly studied speech or music domains.
Universal-150 Audio Benchmark
To the best of our knowledge, there is no publicly available universe audio source separation dataset. Therefore, we built such a dataset that contains 150 audio mixtures and each sample is a two-sound mixture. These mixture samples come from pairs of 30 unique sounds from YouTube and ESC50 sound classification dataset (piczak2015esc) covering a large range of sound categories appeared in our daily life, such as animal (e.g., dog, cat, and rooster), human (e.g., human speech, baby crying, and baby laughing), music (e.g., violin and guitar), natural sounds (e.g., rain, sea wave, and crackling fire), domestic and urban sounds (e.g., clock, keyboard typing, and siren).
Comparison with Blind Separation Methods
We compare our method with several blind source separation (BSS) methods: non-negative matrix factorization (NMF) from spiertz2009source__
, robust principal component analysis (RPCA) fromhuang2012singing, and kernel additive modelling (KAM) from yela2018does. Figure 3 shows that our method outperforms the compared methods qualitatively. Please visit our webpage to listen to the audios. For implementations, we use (nussl) for NMF and RPCA and (yela2018does) for KAM.
Moreover, to quantitatively evaluate sound separation performance of different BSS methods, we run separation on all 150 sounds and compare these methods in three metrics: Signal-to-Distortion Ratio (SDR), Signal-to-Interference Ratio (SIR), and Audio Spectrum Distance (LSD) (vincent2006performance; morgadoNIPS18). The SDR and SIR measure the distortion and interferences in the separated audios. LSD measures the Euclidean distance between a predicted audio spectrum magnitude and the corresponding ground truth audio spectrogram magnitude.
DAP outperforms these three BSS methods in all three numerical metrics, as shown in Table 1. Note that all these four methods, including DAP, do not require any training data. The only input to the algorithm is the single mixture audio file.
Comparison with Methods using Deep Networks
Since existing deep audio separation networks were usually trained on music or speech data, they are supervised and can not handle unseen sounds in our universe sound separation dataset. So we compare our method with two state-of-the-art deep audio networks: Deep Network Prior (DNP) from michelashvili2019audio and Speech Enhancement Generative Adversarial Network (SEGAN) from pascual2017segan on noisy speech data. Figure 4 illustrates speech denoising results. For supervised model, in our tests, SEGAN works well on noises that are similar to the training set. However, for unseen novel noises like the keyboard typing noise in Figure 4 (a), SEGAN did not remove the noise. The proposed DAP can remove background noise. A side effect of DAP aggressively removing noises is the excessive removal of speech signals, which might lead to lower quality in some cases. Note that we do not impose any speech-related denoising prior in the universal DAP model.
3.2 Interactive Mask-based Editing
Since the output of our method contains both a mask and the sound generator, we can add additional constraints on either the mask or the sound spectrogram. Traditionally this type of interaction is only performed on spectrograms since they don’t have the predicted masks available (bryan2014isse). There are several drawbacks with using spectrogram strokes: spectrograms are non-intuitive for non-audio professionals, and even then, a lot of strokes are needed for real-world audios.
In comparison, constraints on 1D masks are a lot easier for users to specify. Figure 5 shows the simple 1D box given by the user can quickly improve the results. These 1D box constraints can be easily drawn by users that are not familiar with frequency spectrograms. Basically they select the regions they think should have sound or should be silence for source . These selected regions would become input activation masks: or deactivate mask: , respectively. The values in the annotated regions are one. To encode these annotations, we introduce mask activation and deactivation losses to refine results from our separation networks:
With the activation and deactivation losses, our DAP framework refines the source generator and masks. This refinement process typically takes tens of seconds. Figure 5 shows an interactive mask editing result. We can see that with adding a mask deactivation loss for the “dog” sound into optimization, our network will remove the “violin” patterns from it.
3.3 Audio Texture Synthesis
Another interesting application is audio synthesis or audio interpolation/extrapolation. The goal of audio synthesis is to lengthen a given audio. This has wide applications in audio texture generation for arbitrary length to match visual content. Here we show an example on how we can leverage the input noise latent space to achieve audio texture synthesis (Figure 6).
Given an audio texture, we apply DAP framework and obtain the temporally coherent input noises for the whole sequence and a generator . To lengthen the input texture, we can first use straightforward interpolation: we can insert more noise frame between every two consecutive noise frames: , , ,
. Those symbols in boxes are new interpolated ones – this way we can easily double (or more) the length of the input audio. If we want to explore more diversity in our learned generator, we can also extrapolate in the latent space. In other words, we can create a brand new latent vector that are slightly outside our input training manifold:. These new input latent vectors and can be used as input to source generator . Figure 6 shows that we can prolong a 3-second audio texture to a 12-second one with a combination of interpolation and extrapolation in latent space. No direct copy and paste is used, so we naturally avoid the seam discontinuity problem.
3.4 Co-separation / Audio Watermark Removal
Audio watermarks are commonly used in the music industry for copyright-protected audios. Supervised deep networks will require a lot of training data to learn both diverse clean audio patterns and watermark patterns for separating watermark sounds from clean sounds (muth2018improving). Our DAP model can also easily generalize to handle co-separation for removing audio watermarks.
Given sounds: containing the mixture of the same unknown audio watermark and the clean audios: . The goal is to recover the clean audios. Using the proposed separation framework, we learn generator networks; for the music signals and for the watermark with shared network weights and input noise. As shown in Figure 7, from the input mixtures, we can extract the clean music as well as the embedded watermark.
4 Ablation Study and Discussion
Temporal Noise Input
To capture the temporal coherence prior within audio spectrograms, we decompose individual noise inputs into temporally consistent segments. To justify the temporal noise, we compare to a variant of our model without temporal noise (w/o TN), which has no temporal consistency but with only random per-frame noise. As shown in Figure 8 (b), we can see that the model without temporal noise fails to preserve temporal coherent “basketball” sound structures while our DAP model can well capture the temporal consistent patterns in the “basketball” sound.
We introduce dynamic noise from curriculum learning to capture large variations in individual sounds and preserve temporal consistent structures. In particular, we use abruptly changing dynamic noise during noise input transition for better quickly learning dynamic patterns for individual sound predictions. To show the effects of dynamic noise and abrupt noise transition, we compare to two baseline models, which are without dynamic noise (w/o DN) and without abruptly changing noise (w/o AC), respectively. The results are illustrated in Figure 8 (c) and (d). We see that the w/o DN model fails to separate the two sounds and it can only capture few dynamic structures in the “violin” sound due to its weak dynamic modeling capacity. Although the w/o AC model can restore more dynamic “violin” patterns, it incorrectly adds “violin” sounds into the “basketball” sounds. In comparison, our full DAP model has strong dynamic modeling capacity for the two individual sound prediction networks immediately.
1D Mask Design
Unlike commonly used unconstrained masks used in images/videos, we design a 1D mask. This design explicitly decomposes sound estimation into two sub-problems: temporal sound prediction and mask modulation, which will ease mask learning and temporal consistency modeling. Such disentanglement idea has been widely used in separating geometry and appearance estimation(NIPS2017_7175; lin2019photometric) To validate the effectiveness of the 1D mask, we compare it to a baseline model with unconstrained masks. As illustrated in Figure 9 (a), we can see that the model without the strong 1D mask constraint easily find a short-cut to minimize the loss functions and the unconstrained masks also capture sound content. With the 1D mask, our DAP model disentangles sound prediction and mask modulation and successfully separates the two sounds.
To demonstrate the effectiveness of the proposed nonzero mask loss, we empirically show results of our DAP without the loss during optimization in Figure 9 (b). From the mask results, we can find that mask values even in some sounding regions are close to zero for both two sounds. With the loss term, DAP can reconstruct sounds for all sounding regions.
RPCA decomposes a spectrogram matrix into a low-rank matrix and a sparse matrix and . To satisfy the principle of the decomposition, will capture repeating structures in the audio spectrogram and the remaining large variations in the spectrogram will be preserved in the sparse matrix. KAM assumes an audio source at a timestep can be estimated using its values at other nearby times through a source-specific proximity kernel, allowing addressing local redundancy in audio sources but missing modeling dynamic patterns.
In essence, various formulations have been proposed to utilize temporal redundancy and dynamics. However, it is challenging for a traditional formulation to capture the temporal consistency and dynamic patterns due to its limited capacity. To model both, we take advantage of the large capacity from deep networks and utilize temporal consistent noise inputs, which helps our network simultaneously restore temporal consistent sounds and capture large variations.
5 Related Work
As discussed in the introduction, our work is inspired by the recent advances in deep image prior (DIP) and double-DIP (ulyanov2018deep; gandelsman2018double). Recently, michelashvili2019audio tried to learn a deep network prior for audio by following DIP process. Instead of learning the priors from audio signals, a predicted apriori SNR of input audio is fed into traditional denoising methods, such as LSA (ephraim1985speech) or the Weiner filter, to perform audio denoising. Therefore their method still relied on the accuracy of existing denoising algorithms. In our deep audio prior framework, we propose to capture the inherent audio priors using deep networks without relying on or bottlenecked by the previous method.
Audio source separation is a classical problem in signal processing (haykin2005cocktail; naik2014blind; makino2018audio). To address the problem, many blind audio source separation methods have been proposed, such as NMF (virtanen2007monaural; smaragdis2003non), RPCA (huang2012singing), and KAM (liutkus2014kernel; yela2018does)
. For multichannel audios, independent component analysis (ICA) is also commonly used(hyvarinen2000independent; smaragdis1998blind; dinh2014nice); in our paper, we focus on separation from single-channel audios. To improve separation performance, supervised NMFs are explored in (mysore2011non; smaragdis2006convolutive). However, these methods usually have limited capacity to handle various and complicate sound patterns. In our work, we leverage the large capacity from deep neural networks to encode both the temporal audio priors and large variances that exist in real-world complex audios.
Recently, deep audio separation networks are proposed (chandna2017monoaural; hershey2016deep; isik2016single; chen2017deep; smaragdis2017neural; le2015deep; wang2018supervised), all of which require a large amount of audio training data. Also, a deep model trained on a certain type of sound sources cannot generalize well on audios from unseen categories, as we shown in §3.1. These limitations restrict the existing deep networks to address scalable universal audio separation. Unlike the existing approaches, our DAP framework is only trained on the single input mixture audio, does not require any additional training data, and hence is immune from the data mismatch between training and testing sets.
6 Conclusion and Future Work
We have introduced Deep Audio Prior, a new audio prior framework that requires zero training data. Thanks to the universal and unsupervised nature, we are excited about the potential applications that DAP can enable. We have demonstrated impressive results on challenging tasks, even when comparing our model to models trained on a large amount of supervised data. All of our examples are listed on the anonymous webpage: https://iclr-dap.github.io/Deep-Audio-Prior/
Limitations and Future Work
Our framework naturally extends to multiple sound sources, e.g., co-segmentation with 4 sources in total. Yet for audio recorded in the wild, one main challenge is to decide how many effective sound sources are present (girin:hal-01943375). One direction that we would like to explore is to use the output/error metrics from DAP separation to iteratively decide the optimal number of effective sources. Visual information can also be used to guide this process (sodoyer2002separation; AytarNIPS2016_6146). It remains an open challenge on how to design a deep audiovisual prior to combine audio and visual information to distill a more robust representation.
In the interactive editing demo (§3.2), we show that the progressive refinement based on user input can be easily done within seconds. However, the initial DAP separation usually takes in the order of minutes for a short audio segment. Possible acceleration in the initial training stage is a promising direction. In supervised deep learning, distilling knowledge from multiple models has shown great success(Hinton44873). Given that DAP works for audios in the wild, we are interested in how to robustly distill useful information from a large amount of learned generators from many single audios.
The work was partly supported by NSF IIS 1741472 and IIS 1813709. We gratefully acknowledge the gift donations of Adobe. This article solely reflects the opinions and conclusions of its authors and neither NSF, nor Adobe. We would like to thank Nicholas Bryan for discussing blind source separation methods and Wilmot Li for brainstorming application scenarios.
Appendix A Appendix
a.1 Network Structure
We use an UNet architecture as in ulyanov2018deep
for our audio and mask generator networks. In the networks, the downsampling is achieved by setting the stride of the convolutional layer as 2, and the upscaling is implemented by Bilinear interpolation. In our implementation, we use a relative smaller UNet structure, which only consists ofdownsampling modules and upsampling modules. The parameters of the used UNet are , , , [None, None, 1]. During tuning parameters of the UNet, we have the same observation as ulyanov2018deep that a wide of range of hyper-parameters for the UNet can achieve similar performance. We empirically found that our models can be converged in 5000 iterations.
a.2 Additional Sound Separation Results
To further clarify the numerical results in the Table 1, except from the mean SDR, mean SIR, and mean LSD, which are sensitive to over large or over small values (variance also has the same issue), we count the better sample ratio based on the three metrics. The ratio will count our DAP is better than the compared the method on how many testing samples. For example, compared to the NMF in terms of SDR, the ratio is 0.753, which shows that our method is better than the NMF on 75.3% testing samples. The comparison results are shown in Table 2. We can see that our DAP achieve better results on a majority of testing samples comparing to NMF/RPCA/KAM.
|DAP vs. NMF||DAP vs. RPCA||DAP vs. KAM|
To further validate our DAP on sound source separation, we compare with other methods on a standard sound separation benchmark from (vincent2007oracle), which has 20 testing samples and each sample has 3 clean sounds. To evaluate different sound separation methods, we take the first two sounds from each sample to compose 2-sound mixtures. As shown in Table 3, We can see that our method outperforms the three compared methods in terms of SDR, SIR, and LSD. The Table 4 illustrates better ratio results. We can also find that our DAP still achieves overall better results.
|DAP vs. NMF||DAP vs. RPCA||DAP vs. KAM|
a.3 Dynamic Results
To model dynamic patterns with large variations in audio sources, we we gradually add dynamic noise into inputs as we progress more training iterations. In our implementation, the two iteration parameters: and in the Equation (6) are set as 2000 and 4000, respectively. Note that the maximum training iteration number is 5000. When iteration number is smaller than , only temporal coherent noise inputs are used; when iteration number is in [2000, 4000], we gradually add dynamic noise into inputs; when iteration number is large than 4000, noise inputs are fixed. To illustrate the training dynamics, we show separation results at different training iterations in Figure 11. We can see that our model first restore temporal coherent audio sources with the input noise constraint; when we introduce dynamic noise into input (after 2000 iterations), the model will quickly learn the dynamic patterns in the sound mixtures and well capture sound variations while preserving temporal consistent audio sources.