1 Introduction
Audio is frequently recorded in a noisy environment, such as the outdoors or a stadium filled with screaming fans. This motivates the task of noise reduction, i.e., removing noise from a corrupted transmission and returning a clean recording. This has a myriad of applications, such as enhancing the quality of a concert recording or making a speech easier to parse for automated transcription. Therefore, a number of approaches to this problem have been studied, including Wiener filters [14], spectral noise gates [4]
, and deep neural networks
[1].One approach that has seen a lot of use in recent years is dictionary learning via non-negative matrix factorization (NMF) [10, 16, 13, 11, 12]. In the NMF method, one aims to factorize a non-negative data matrix as a product of a dictionary matrix and a code matrix , with . We typically interpret columns of the data matrix as the data points, and the columns of the dictionary matrix as the atoms of a dictionary, such that every data point can be represented as a non-negative linear combination of dictionary atoms. Furthermore, one often desires that each data point is represented using only a few dictionary atoms, which can be mathematically formalized as a sparsity requirement on the code matrix .
A major advantage of NMF is its interpretability. For example, when applied to pictures of human faces, the atoms of the dictionary produced by NMF will resemble, e.g, a person’s eyes or ears [5]. This is in contrast to other factorization approaches, such as principle component analysis, which yield hard to interpret “eigenfaces” [9]. This qualitative difference is explained by the non-negativity constraint. Since there are no negative terms, cancellation cannot occur, and thus, dictionary atoms are forced to concentrate around salient features. In the context of audio denoising, the dictionary obtained by NMF factorization is expected to pick up structural features of the clean signal so that the signal and the noise components of the noisy recording will be represented by different dictionary atoms.
In most cases, an exact factorization is not possible. Instead, one aims to obtain a close approximation of the data matrix by a product
by minimizing a loss function
subject to the constraint that and have non-negative entries. In this paper, we will measure the approximation error via the Frobenius norm and use regularization to promote sparsity of the code matrix. With these choices, we are able to obtain our code and dictionary matrices via an explicit multiplicative update formula.In this paper, we obtain the data matrix by taking the spectrogram of the input signal
, i.e., the magnitude of its short-time Fourier transform. Such measurements are commonly used in audio processing tasks (see, e.g.,
[2]) and encode information about which frequencies are active at each point of time in a recording. Our method is motivated by the observation that, unlike noise, most natural sounds, such as voices and musical instruments, have only a few dominant frequencies active at each point in time. Thus, one may hope to obtain a dictionary in which “signal atoms” can be readily distinguished from the “noise atoms”. In particular, given prior information on what the signal and the noise are like, one can train two dictionaries and . Then, if and denote the spectrograms of the noisy signal and the clean signal, respectively, one can expect that can sparsely code the spectrogram of the clean signal but not .In this paper, we use Online NMF (ONMF), an algorithm developed for streaming data or for situations where the data set is too large to store in local memory. In the latter case, ONMF alleviates the memory burden by allowing one to load only a portion of the data set at a time. As with traditional NMF, we learn a dictionary matrix where the columns of can be well-approximated as non-negative linear combinations of the columns of . However, in this context, we view the columns of
as samples of some probability distribution and we view the dictionary
as learning the essential components of this distribution. As discussed in [6], there is a natural application of ONMF to non-negative time-series data where the columns are interpreted as the terms of a vector-valued stochastic process.
1.1 Contribution
This paper builds upon previous work applying NMF to spectrogram measurements for the purpose of audio separation (see, e.g. [3, 16]). Instead of the traditional NMF approach developed in these previous works, we use online non-negative matrix factorization. This is motivated by several considerations.
First, when using ONMF, one views the spectrogram as a time-series of vectors, where each vector indicates the active frequencies at a specific time. By sampling batches from this time-series, ONMF learns a dictionary that is better suited to represent phonemes and musical chords, compared to the dictionary obtained using NMF, which does not exploit the time-frequency interpretation of the spectrogram. Therefore, one might hope that ONMF will learn different atoms from traditional NMF and achieve better reconstruction. Indeed, based on our numerical experiments in Section 3, the dictionary learned by ONMF is different than the one learned by traditional NMF and our ONMF-based denoising algorithm exhibits superior performance.
Secondly, ONMF does not require one to store the entire spectrogram , but instead works with smaller matrices obtained by subsampling the columns of . Therefore, ONMF is significantly more memory-efficient than traditional NMF.
Lastly, unlike traditional NMF which requires the entire recording to be known in advance, ONMF has potential applications to the real-time denoising of streaming audio. As a simple motivating example, consider the streaming broadcast of a concert. A microphone near the band will observe a mixed signal of the band and audience. We can learn a dictionary for the clean signal from the band’s studio recordings, and a microphone placed in the audience will observe a signal with very heavy noise. Using ONMF, one could use the sound picked up by the audience microphone to a learn dictionary for the noise in real time and use this dictionary to denoise the recording picked up by the stage microphone.
The rest of this paper is organized as follows. In Section 2, we explain how to apply ONMF to the noise reduction problem. In Section 3, we present experimental results demonstrating the utility of the ONMF approach and its advantages over traditional NMF. Lastly, in Section 4, we provide a brief conclusion and discuss future research directions.
2 Method
We assume that the observed signal can be written as
where is the clean signal and is noise. Let and denote the spectrograms of and , respectively, and denote
. The spectrogram is not linear, but heuristically, we will think of
as the spectrogram of the noise.We further assume that we have priors for signals and . More precisely, if, e.g., is a noisy recording of a person talking, then we assume to have access to a clean sample of that person talking and also a sample of the noise
recorded in the same environment or sampled from the same probability distribution as
. Let and denote the spectrograms of and , respectively. We take and to be dictionaries learned using NMF approach from and by minimizing the loss function (1).As the columns of are trained to represent the pure signal, we expect that they can sparsely code but not . In order to decompose into , we concatenate and to form and obtain a coding matrix by minimizing the loss function
(1) |
for a suitably chosen regularization parameter . This problem can be solved in a variety of ways including, e.g., the multipicative update scheme described in Algorithm 1 (see, e.g., [3]). After solving this minimization problem, we will have and .
One of the drawbacks of this NMF-based noise reduction approach is that minimization of and requires the entire matrices and to be loaded in memory and to be available ahead of time. ONMF, on the other hand, is much more memory efficient, as it does not require loading the entire data matrix at the same time. Furthermore, it does not require that and are known in advance, which makes it possible to use in the online settings.
Our method is based on sampling columns of , which we interpret as “time slices” of the spectrogram, and organizing them into ”time sample” submatrices. We then iteratively update the dictionary matrix to obtain a close approximation of all the sampled submatrices. More precisely, let be some large integer and choose . For , we let be an matrix obtained by randomly selecting columns of . In ONMF, one iteratively learns the best factorization . On each iteration, one first finds which minimizes and then finds which minimizes the average value of , . Unfortunately, the straightforward update rule of
(2) |
requires storing all of the matrices . This creates a considerable memory burden. In [8], the authors were able to alleviate it by aggregating all of the relevant past information from the first steps into two aggregation matrices and . As detailed in Algorithm 2, this allows one to compute via multiplicative updates without storing all of the , Moreover, the authors show that this more complicated update rule is equivalent to the intuitive update rule (2) and also provide theoretical convergence guarantees for i.i.d. data. These convergence guarantees were subsequently extended to Markovian data in [7].
Remark 1.
Since matrices are obtained by randomly sampling the columns of , we naturally obtain a sequence of i.i.d. data matrices as analyzed in [8]. However, in the setup of online audio processing, it is reasonable to model the columns of as a vector-valued Markov chain. In this case, if one also models the columns of
By applying the ONMF algorithm described in Algorithm 2 to and , we obtain dictionaries and , which we concatenate into . Then we find a sparse coding matrix such that
and define our estimates of the
and byNext, we apply a post-processing step to enforce that . In particular, we set
and define similarly. As the experiments show, this step greatly enhances the quality of our recovered audio. The matrix is our estimate of magnitudes of the STFT of . We estimate the phases by setting them equal to the phases of the STFT of . Finally, we obtain the estimate apply an inverse STFT to get an estimate of
3 Experiments
In our experiments, we will apply both OMNF and traditional NMF to the noise reduction problem and show that ONMF exhibits superior performance.111Our code is available at https://github.com/Jerry-jwz/Audio-Enhancement-via-ONMF.
We consider signals corrupted by both synthetically produced Gaussian white noise and signals corrupted by real-world noise produced by a Levoit-H132 air purifier. We evaluate a denoising method performance using three standard accuracy measures: the signal-to-distortion ratio (SDR), the signal-to-interference ratio (SIR) and the signal-to-artifacts ratio (SAR). We refer the reader to
[15] for the definitions.In both methods, we let the signal dictionary have 50 columns and let the noise dictionary
have 10. In the training stage, we construct our initial dictionary by setting each entry to have i.i.d. random entries uniformly distributed between 0 and 1. After each iteration, we renormalize our columns to have unit
-norm. In the sparse coding stage, we set the regularizer parameter equal to 100.In Algorithm 1, we apply the stopping criterion:
In Algorithm 2, we set and for we randomly selecting 100 columns of to form In Tables 1 and 2, we summarize SDR, SIR, and SAR of the original, noisy signal as well as the signals obtained via the NMF and ONMF based denoising approaches. These values of SDR, SIR, and SAR are inversely proportional to the noise norm, so larger values imply are better quality of denoising.
Method | SDR | SIR | SAR |
---|---|---|---|
NMF | 19.43 | 31.42 | 19.72 |
ONMF | 22.70 | 53.45 | 22.70 |
ORIGINAL | 9.75 | 9.76 | 37.41 |
Method | SDR | SIR | SAR |
---|---|---|---|
NMF | 9.46 | 13.90 | 11.63 |
ONMF | 10.41 | 13.11 | 13.95 |
ORIGINAL | 5.91 | 5.91 | 286.50 |
Figure 1 qualitatively illustrates the performance of NMF- and ONMF-based denoising algorithms. It shows a plot of the original clean and noisy signal spectrograms, as well as those reconstructed by the two algorithms from a signal corrupted by white noise. We observe both qualitatively and quantitatively that our reconstruction algorithm based on ONMF outperforms the one based on traditional NMF.
![]() |
![]() |
![]() |
![]() |
Between the two experiments, we see that the ONMF-based method outperforms its NMF-based counterparts on five out of six metrics and is only slightly worse with respect to SIR in the case of real-world noise. In both methods, we observe that the algorithm appears to introduce artifacts and the SAR decreases during the denoising process (quite significantly in the case of real-world noise).
We also investigate the role of our regularization parameter . In Table 3, we plot the SIR for different values of while using 50 columns in the signal dictionary and 10 columns in the noise dictionary. Similar experiments with differing numbers of columns and using SDR and SAR in place of SIR indicate essentially the same dependencies.
50 | 60 | 70 | 80 | 90 | |
---|---|---|---|---|---|
SIR | 52.71 | 59.54 | 80.77 | 58.96 | 53.70 |
These results suggest that the accuracy of denoising improves as increases until a certain threshold value, after which the accuracy deteriorates. This is because we need our code to have enough nonzero terms so that can be used to approximate but not so many that it can be used to approximate .
4 Conclusion
In this paper, we have proposed a novel method of noise reduction based on online non-negative matrix factorization. We show, via both quantitative and qualitative metrics, that numerically this method exhibits superior performance to methods based on traditional NMF. It is also more memory efficient and can be tailored to perform online denoising of streaming speech and music. In the future, one might hope to build upon this research by using more sophisticated loss functions than the one considered here. For instance, one might use the Kullback-Leibler divergence in place of the Frobenius norm or use carefully crafted regularizers to enforce a certain time-frequency structure on the atoms.
References
- [1] Haichuan Bai, Fengpei Ge, and Yonghong Yan. Dnn-based speech enhancement using soft audible noise masking for wind noise reduction. China Communications, 15(9):235–243, 2018.
- [2] Sander Dieleman and Benjamin Schrauwen. End-to-end learning for music audio. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6964–6968. IEEE, 2014.
- [3] Cédric Févotte, Emmanuel Vincent, and Alexey Ozerov. Single-channel audio source separation with nmf: divergences, constraints and algorithms. Audio Source Separation, pages 1–24, 2018.
- [4] Davi Miara Kiapuchinski, Carlos Raimundo Erig Lima, and Celso Antônio Alves Kaestner. Spectral noise gate technique applied to birdsong preprocessing on embedded unit. In 2012 IEEE International Symposium on Multimedia, pages 24–27, 2012.
- [5] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788, 1999.
- [6] Hanbaek Lyu, Georg Menz, Deanna Needell, and Christopher Strohmeier. Applications of online nonnegative matrix factorization to image and time-series data. In 2020 Information Theory and Applications Workshop (ITA), pages 1–9. IEEE, 2020.
-
[7]
Hanbaek Lyu, Deanna Needell, and Laura Balzano.
Online matrix factorization for markovian data and applications to
network dictionary learning.
Journal of Machine Learning Research
, 21(251):1–49, 2020. - [8] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11(1), 2010.
-
[9]
Pablo Navarrete and Javier Ruiz-del Solar.
Analysis and comparison of eigenspace-based face recognition approaches.
International Journal of Pattern Recognition and Artificial Intelligence
, 16(07):817–830, 2002. - [10] Mikkel N. Schmidt, Jan Larsen, and Fu-Tien Hsiao. Wind noise reduction using non-negative sparse coding. In 2007 IEEE Workshop on Machine Learning for Signal Processing, pages 431–436, 2007.
-
[11]
Mikkel N Schmidt and Rasmus Kongsgaard Olsson.
Single-channel speech separation using sparse non-negative matrix
factorization.
In Interspeech
, volume 2, pages 2–5. Citeseer, 2006.
- [12] Sören Schulze and Emily J King. Sparse pursuit and dictionary learning for blind source separation in polyphonic music recordings. EURASIP Journal on Audio, Speech, and Music Processing, 2021(1):1–25, 2021.
-
[13]
Kazuki Shimada, Yoshiaki Bando, Masato Mimura, Katsutoshi Itoyama, Kazuyoshi
Yoshii, and Tatsuya Kawahara.
Unsupervised speech enhancement based on multichannel nmf-informed beamforming for noise-robust automatic speech recognition.
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(5):960–971, 2019. - [14] L.M. Surhone, M.T. Timpledon, and S.F. Marseken. Wiener Filter: Norbert Wiener, Noise, Andrey Kolmogorov, Frequency Response, Stochastic Process, Cross-correlation, Deconvolution, Wiener Deconvolution, Expected Value, Quantization Error. Betascript Publishing, 2010.
- [15] E. Vincent, R. Gribonval, and C. Fevotte. Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech, and Language Processing, 14(4):1462–1469, 2006.
- [16] Kevin W. Wilson, Bhiksha Raj, Paris Smaragdis, and Ajay Divakaran. Speech denoising using nonnegative matrix factorization with priors. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 4029–4032, 2008.
Comments
There are no comments yet.