I Introduction
The modeling of speech has important applications in speech analysis [1], speaker verification [2], speech synthesis[3], etc. Based on the source-filter model, speech is modelled as being produced by a pulse train or white noise for voiced or unvoiced speech, which is further filtered by the speech production filter (SPF) that consists of the vocal tract and lip radiation.
All-pole modeling with a least squares cost function performs well for white noise and low pitch excitation. However, for high pitch excitation, it leads to an all-pole filter with poles close to the unit circle, and the estimated spectrum has a sharper contour than desired [4, 5]. To obtain a robust linear prediction (LP), the Itakura-Saito error criterion [6], the all-pole modeling with a distortionless response at frequencies of harmonics [4], the regularized LP [7] and the short-time energy weighted LP [8] were proposed. Motivated by the compressive sensing research, a least 1-norm criterion is proposed for voiced speech analysis [9], where sparse priors on both the excitation signals and prediction coefficients are utilized. Fast methods and the stability of the 1-norm cost function for spectral envelope estimation are further investigated in [10, 11]. More recently, in [12]
, the excitation signal of speech is formulated as a combination of block sparse and white noise components to capture the block sparse or white noise excitation separately or simultaneously. An expectation-maximization (EM) algorithm is used to reconstruct the block sparse excitation within a sparse Bayesian learning (SBL) framework
[13].A problem with the all-pole model is that some sounds containing spectral zeros with voiced excitation, such as nasals, or laterals, are poorly estimated by an all-pole model but trivial with a pole-zero (PZ) model [14, 15]. The estimation of the coefficients of the pole-zero model can be obtained separately [16], jointly [17] or iteratively [18]. A 2-norm minimization criterion with Gaussian residuals assumption is commonly used. Frequency domain fitting methods based on a similarity measure is also proposed. Motivated by the logarithmic scale perception of the human auditory system, the logarithmic magnitude function minimization criterion has been proposed [19, 15]. In [19], the nonlinear logarithm cost function is solved by transforming it into a weighted least squares problem. The Gauss-Newton and Quasi-Newton methods for solving it are further investigated in [15]. To consider both the voiced excitation and the PZ model, a speech analysis method based on the PZ model with sparse excitation in noisy conditions is presented [20]. A least 1-norm criterion is used for the coefficient estimation, and sparse deconvolution is applied for deriving sparse residuals.
In this paper, we propose a speech analysis method based on the PZ model with mixed excitation. Using the mixed excitation and PZ modeling together, we combine the advantages of non-sparse and sparse algorithms, and obtain a better fitting for both the excitation and SPF spectrum. Using the PZ model, instead of the all-pole model, a better spectral fitting can be expected. Moreover, we model both the voiced, the unvoiced excitation or a mixture of them by the mixed excitation. Additionally, block sparsity is imposed on the voiced excitation component, motivated by the quasi-periodic and temporal-correlated nature of the glottal excitation [21, 12]
. The posterior probability density functions (PDFs) for the sparse excitation and hyper-parameters, as well as point estimates of the PZ model parameters are obtained using the VEM method.
Ii Signal models
Consider the following general speech observation model:
(1) |
where is the observation signal and denotes the noise. We assume that the clean speech signal is produced by the PZ speech production model, i.e.,
(2) |
where and are the modeling coefficients of the PZ model with , is a sparse excitation corresponding to the voiced part and is the non-sparse Gaussian excitation component corresponding to the unvoiced part. Assuming and considering one frame of speech signals of samples, (1) and (2) can be written in matrix forms as
(3) | ||||
(4) |
where and are the lower triangular Toeplitz matrices with and as the first columns, respectively. The block sparse residuals are defined as , and , , and are defined similarly to . When ,
reduces to the identity matrix and (
4) becomes the all-pole model. Combining (3) and (4), the noisy observation can be written as(5) |
In [20], we assumed that only the sparse excitation was present (, but ). The sparse residuals and model parameters were estimated iteratively. The sparse residuals were obtained by solving
(6) |
where
is a constant proportional to the variance of the noise. The model parameters was estimated using the
norm of the residuals as the cost function (see [20] for details).Iii Proposed Variational EM method
We now proceed to consider the noise-free scenario but with mixed excitation (, but ). We consider the pole-zero model parameters and to be deterministic but unknown. Utilizing the SBL [13] methodology, we first express the hierarchical form of the model as
(7) |
where is the number of blocks, , is the Kronecker product, is the block size, , denotes the multivariate normal PDF and
is the Gamma PDF. The hyperparameter
is the precision of the block, and when it is infinite, the block will be zero. Note that it is trivial to extend the proposed method to any . Moreover, when , each element in is inferred independently. Here, block sparsity model is used to take the quasi-periodic and temporal-correlated nature of the voiced excitation into account. The is used for capturing the white noise excitation from unvoiced speech frame or a mixture of phonations.Our objective is to obtain the posterior densities of , and , and point estimates of the model parameters in and . First, we write the complete likelihood, i.e.,
(8) |
where we used when . Instead of finding the joint posterior density , which is intractable, we adopt the variational approximation [22]. Assume that is approximated by the density , which may be fully factorized as
(9) |
where the factors are found using an EM-like algorithm [22].
In the E-step of the VEM method, we fix the model parameters and , and re-formulate the posterior PDFs estimation problem as maximizing the variational lower bound
(10) |
where is the shorthand for , is defined as , and denotes the expectation of
w.r.t. the random variable
(i.e., ). Substituting (III) and (9) into (10), and following the derivation from [22], we obtain(11) |
It is clearly seen that in (III) is a Gaussian PDF, i.e.,
(12) |
where and . We also define the auto-correlation matrix . The posterior PDF of in (III) is a Gamma probability density, i.e.,
(13) |
where , and denotes the element of . The expectation of the precision matrix is . Similar to , the posterior PDF of is
(14) |
where , . The expectation of can be expressed as .
In the M-step, we maximize the lower bound (10) w.r.t. the modeling parameters and , respectively. The optimization problems can be shown to be equivalent to , respectively. To obtain the estimate for , we first note that can be expressed as , where is a Toeplitz matrix of the form
Using this expression and obtained in the E-step, the minimization problem w.r.t. can be re-formulated as
(15) |
As can be seen, (III) is the standard least squares problem and has the analytical solution as
(16) |
We can obtain the solution of , like , by setting the derivative of w.r.t. to zero, i.e.,
(17) |
where is an lower triangular Toeplitz matrix of the form
From (III), we obtain the estimate of , i.e.,
(18) |
where is an symmetric matrix with the element given by . The can be obtained by simply replacing the stochastic variable in with the mean estimate (the element in ). The is an vector with the element given by . Note that the estimation of in (18) requires the knowledge of and vice versa (see (16)). This coupling is solved by replacing them with their estimates from previous iteration. The algorithm is initialized with , , and , and starts with the update of . We refer to the proposed variational expectation maximization pole-zero estimation algorithm as the VEM-PZ.
Iv Results
In this section, we compare the performance of the proposed VEM-PZ, the two-stage least squares pole-zero (TS-LS-PZ) method [14], 2-norm linear prediction (2-norm LP) [1], 1-norm linear prediction (1-norm LP)[9] and expectation maximization based linear prediction (EM-LP) for mixed excitation [12] in both synthetic and real speech signals analysis scenarios.
Iv-a Synthetic signal analysis
F0 | 200 | 250 | 300 | 350 | 400 |
---|---|---|---|---|---|
2-norm LP | 1.79 | 2.14 | 2.12 | 2.53 | 2.13 |
TS-LS-PZ | 2.41 | 4.77 | 1.88 | 1.46 | 2.86 |
1-norm LP | 2.43 | 3.15 | 3.60 | 3.29 | 4.29 |
EM-LP | 5.62 | 6.68 | 4.68 | 3.96 | 4.83 |
VEM-PZ, D=1 | 4.50 | 7.14 | 2.29 | 1.54 | 2.31 |
VEM-PZ, D=5 | 1.55 | 4.47 | 0.69 | 2.01 | 4.50 |
VEM-PZ, D=7 | 2.08 | 4.07 | 2.18 | 1.41 | 1.29 |
VEM-PZ, D=8 | 0.77 | 5.56 | 2.52 | 4.86 | 0.53 |
We first examine the performance of the VEM-PZ with different block size and compares it with traditional algorithms using synthetic voiced consonant /n/, as shown in Fig. 1 and Fig. 2. The synthetic signals are generated by convolving an artificial glottal source waveform with a constructed filter. The first derivative of the glottal flow pulse is simulated with the Liljencrants-Fant (LF) waveform [23] with the modal phonation mode, whose parameter is taken from Gobl [24]. The voiced alveolar nasal consonant /n/ is synthesized at 8 kHz sampling frequency with the constructed filter having two formant frequencies (bandwidths) of 257 Hz (32 Hz) and 1891 Hz (100 Hz) and one antiformant of 1223 Hz (52 Hz) [25]. is set to 240 (30 ms), and are used for all the experiments. The power ratio of the block sparse excitation and Gaussian component is set to 30 dB, and the pitch is set to 200 Hz. The and are set to 5 for the pole-zero modeling methods (i.e., VEM-PZ and TS-LS-PZ), but is used for the all-pole modeling methods (i.e., 2-norm LP, 1-norm LP and EM-LP). In Fig. 1, the means of the residuals of EM-LP and VEM-PZ are plotted. Note that the residuals of the TS-LS-PZ, 1-norm LP and EM-LP are prepended with zeros due to the covariance-based estimation methods. As can be seen in Fig. 1, the residual estimate of the proposed VEM-PZ with is closest to the true block sparse excitation. Moreover, when , the residuals of the VEM-PZ are the sparsest as expected. Although the EM-LP also produces block sparse residuals compared with the 1-norm LP, the proposed method achieves the best approximation to the true one due to the usage of the pole-zero modeling and the block-sparse motivated SBL method. The corresponding spectral estimates are shown in Fig. 2. First, as can be seen, the VEM-PZ with has a smooth power spectrum due to the sparse residuals in Fig. 1. Second, the 1-norm LP tends to produce a better estimate of the formants than the 2-norm LP and TS-LS-PZ. Third, although the EM-LP has two peaks around the first formant, it performs well for second formant estimation. Finally, the proposed VEM-PZ with has good performance for both formant, antiformant and bandwidth estimates because of the block sparse excitation and the pole-zero model.
Second, the spectral distortion is tested under different fundamental frequencies and block sizes. The measure is defined as the truncated power cepstral distance [26], i.e.,
(19) |
where and are the true and estimated power cepstral coefficients, respectively. Cepstral coefficients are computed by first reflecting all the poles and zeros to the inside of the unit circle and then using the recursive relation in [25]. In our experiments, we set . The fundamental frequency rises from 200 to 400 Hz in steps of 50 Hz. The experimental results in TABLE LABEL:tab:sd_table1 are obtained by the ensemble averages over 500 Monte Carlo experiments. is used for the EM-LP [12]. As can be seen from TABLE LABEL:tab:sd_table1, the 2-norm LP has a lower spectral distortion than the 1-norm LP, EM-LP and TS-LS-PZ (except for 300 and 350 Hz). The proposed VEM-PZ achieves the lowest spectral distortion for 200, 300, 350 and 400 Hz. However, note that the good performance of the VEM-PZ depends on a good choice of the block sizes for different fundamental frequencies, and there is a fluctuation when the frequency changes. This is because the length of correlated samples changes with the fundamental frequency. But, as can be seen from Fig. 2 and from our experience, the VEM-PZ usually produces better formant, antiformant and bandwidth estimates than traditional ones.
Iv-B Speech signal analysis
We examine the residuals and spectral estimate of the VEM-PZ for a nasal consonant /n/ in the word ”manage” from the CMU Arctic database [27, 28], pronounced by an US female native speaker, sampled of 8000 Hz. The results are shown in Fig. 3 and Fig. 4. To improve the modeling flexibility, the and are set to 10 for the PZ methods (i.e., VEM-PZ and TS-LS-PZ), but is still used for the all-pole methods (i.e., 2-norm LP, 1-norm LP and EM-LP). As can be seen from Fig. 3, the residuals of the EM-LP and 1-norm LP are sparser than the 2-norm LP. The residuals of the proposed VEM-PZ with are the sparsest. The proposed VEM-PZ with is block sparse and is sparser than the TS-LS-PZ. From Fig. 4, we can see that all the algorithms have formant estimates around 150 Hz. However, the TS-LS-PZ, 1-norm LP and VEM-PZ with have very peaky behaviour. Also, the 2-norm LP produces bad bandwidth estimates around 2000 and 2900 Hz. Furthermore, compared to the EM-LP, the proposed VEM-PZ with D=8 has good antiformant estimates around 500 and 1500 Hz.
V Conclusion
A variational expectation maximization pole-zero speech analysis method has been proposed. By using the pole-zero model, it can fit the spectral zeros of speech signals easily. Moreover, block sparse residuals are encouraged by applying the sparse Bayesian learning method. By iteratively updating parameters and statistics of residuals and hyperparameters, block sparse residuals can be obtained. The good performance has been verified by both synthetic and real speech experiments. The proposed method is promising for speech analysis applications. Next, further research into the formant, antiformant and bandwidth estimation accuracy, stability, and unknown sparse pattern should be conducted.
References
- [1] J. Makhoul, “Linear prediction: A tutorial review,” Proc. IEEE, vol. 63, no. 4, pp. 561–580, 1975.
- [2] J. Pohjalainen, C. Hanilci, T. Kinnunen, and P. Alku, “Mixture linear prediction in speaker verification under vocal effort mismatch,” IEEE Signal Process. Lett., vol. 21, no. 12, pp. 1516–1520, dec 2014.
- [3] D. Erro, I. Sainz, E. Navas, and I. Hernaez, “Harmonics plus noise model based vocoder for statistical parametric speech synthesis,” IEEE J. Sel. Topics Signal Process., vol. 8, no. 2, pp. 184–194, 2014.
- [4] M. N. Murthi and B. D. Rao, “All-pole modeling of speech based on the minimum variance distortionless response spectrum,” IEEE Trans. Speech Audio Process., vol. 8, no. 3, pp. 221–239, 2000.
- [5] T. Drugman and Y. Stylianou, “Fast inter-harmonic reconstruction for spectral envelope estimation in high-pitched voices,” IEEE Signal Process. Lett., vol. 21, no. 11, pp. 1418–1422, 2014.
- [6] A. El-Jaroudi and J. Makhoul, “Discrete all-pole modeling,” IEEE Trans. Signal Process., vol. 39, no. 2, pp. 411–423, 1991.
- [7] L. A. Ekman, W. B. Kleijn, and M. N. Murthi, “Regularized linear prediction of speech,” vol. 16, no. 1, pp. 65–73, 2008.
- [8] P. Alku, J. Pohjalainen, M. Vainio, A. M. Laukkanen, and B. H. Story, “Formant frequency estimation of high-pitched vowels using weighted linear prediction.” J. Acoust. Soc. Am., vol. 134, no. 2, pp. 1295–1313, Auguest 2013.
- [9] D. Giacobello, M. G. Christensen, M. N. Murthi, S. H. Jensen, and M. Moonen, “Sparse linear prediction and its applications to speech processing,” vol. 20, no. 5, pp. 1644–1657, jul 2012.
- [10] D. Giacobello, M. G. Christensen, T. L. Jensen, M. N. Murthi, S. H. Jensen, and M. Moonen, “Stable 1-norm error minimization based linear predictors for speech modeling,” IEEE/ACM Trans. Audio, Speech, and Lang. Process., vol. 22, no. 5, pp. 912–922, may 2014.
- [11] T. L. Jensen, D. Giacobello, T. V. Waterschoot, and M. G. Christensen, “Fast algorithms for high-order sparse linear prediction with applications to speech processing,” Speech Commun., vol. 76, pp. 143–156, 2016.
- [12] R. Giri and B. D. Rao, “Block sparse excitation based all-pole modeling of speech,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. IEEE, 2014.
- [13] M. E. Tipping, “Sparse bayesian learning and the relevance vector machine,” J. Mach. Learn. Res., vol. 2001, pp. 211– 244, 2001.
- [14] P. Stoica and R. Moses, Spectral Analysis of Signals. New Jersey: Prentice Hall, Upper Saddle River, 2004.
- [15] D. Marelli and P. Balazs, “On pole-zero model estimation methods minimizing a logarithmic criterion for speech analysis,” vol. 18, no. 2, pp. 237–248, 2010.
- [16] J. Durbin, “The fitting of time-series models,” Rev. Int’l Statistical Inst., vol. 28, no. 3, pp. 233–244, 1960.
- [17] E. Levy, “Complex-curve fitting,” IRE Trans. Automat. Contr., vol. AC-4, no. 1, pp. 37–43, 1959.
- [18] K. SteTiange, “On the simultaneous estimation of poles and zeros in speech analysis,” IEEE Trans. Acoust., Speech, Signal Process., vol. 25, no. 3, pp. 229–234, 1977.
- [19] T. Kobayashi and S. Imai, “Design of IIR digital filters with arbitrary log magnitude function by WLS techniques,” IEEE Trans. Acoust., Speech, Signal Process., vol. 38, no. 2, pp. 247–252, 1990.
- [20] L. Shi, J. R. Jensen, and M. G. Christensen, “Least 1-norm pole-zero modeling with sparse deconvolution for speech analysis,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Mar. 2017, to be published.
- [21] P. Alku, “Glottal inverse filtering analysis of human voice production,” Sadhana - Academy Proceedings in Engineering Sciences, vol. 36, 2011.
- [22] C. M. Bishiop, Pattern recognition and machine learning, M. Jordan, J. Kleinberg, and B. Scho, Eds. Springer, 2006.
- [23] G. Fant, J. Liljencrants, and Q. Lin, “A four-parameter model of glottal flow,” STL-QPSR4 (Speech, Music and Hearing, Royal Institude of Technology, Stochholm, Sweden), pp. 1–13, 1985.
- [24] C. Gobl, “A preliminary study of acoustic voice quality correlates,” STL-QPSR4 (Speech, Music and Hearing, Royal Institude of Technology, Stochholm, Sweden), pp. 9– 22, 1989.
- [25] D. D. Mehta, D. Rudoy, and P. J. Wolfe, “Kalman-based autoregressive moving average modeling and inference for formant and antiformant tracking,” J. Acoust. Soc. Am., vol. 132, no. 3, pp. 1732–1746, Sep. 2012.
- [26] L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Prentice-Hall, 1993.
- [27] J. Kominek and A. Black, “The CMU Arctic speech databases,” in 5th ISCA Speech Synthesis Workshop. Pittsburgh, PA, 2004.
- [28] CMU-ARCTIC Speech Synthesis Databases. [Online]. Available: http://festvox.org/cmu_arctic/index.html
Comments
There are no comments yet.