1 Introduction
Timefrequency (TF) analysis is the foundation of audio and speech signal processing. The shorttime Fourier transform (STFT) is a widely used tool, which can be effectively implemented by FFT
[1]. STFT features straightforward interpretation of a signal. It provides uniform time and frequency resolution with linearlyspaced TF bins. The corresponding theory was generalized in the framework of Gabor analysis and Gabor frames [2, 3, 4].Signal synthesis is an important application area of timefrequency transforms. Signal modification, denoising, separation and so on can be achieved by manipulating the analysis coefficients to synthesize a desired one. The theory of Gabor multiplier [5] or, in general terms, frame multiplier [6, 7] provides a basis for the stability and invertibility of such operations. A frame multiplier is an operator that converts a signal into another by pointwise multiplication in the transform domain for resynthesis. The sequence of multiplication coefficients is called a frame mask (or symbol). Such operators allow easy implementation of timevarying filters [8]. They have been used in perceptual sparsity [9], denoising [10] and signal synthesis [11]
. Algorithms to estimate frame mask between audio signals were investigated in
[11, 12], where it was demonstrated that the frame mask between two instrumental sounds (of a same note) was an effective measure to characterize timber variations between the instruments. Such masks were used for timber morphing and instrument categorization. In this paradigm, the two signals were of the same fundamental frequency and their harmonics were naturally aligned, which vouched for the prominence of the obtained mask for TF analysis/synthesis with uniform resolution.This study extends the frame mask method to speech signals. One intrinsic property of (voiced) speech signal is that the fundamental frequency ( or pitch) varies consecutively over time. Therefore, the harmonic structures are not well aligned when comparing two signals. We propose to employ the nonstationary Gabor transform (NSGT) [13] to tackle this issue. NSGT provides flexible timefrequency resolution by incorporating dynamic time/frequency hopsize and dynamic analysis windows [13, 14, 15]. We develop an NSGT whose frequency resolution changes over time. We set the frequency hopsize in ratio to to achieve harmonic alignment (or partial alignment cf. Section 4) in the transform domain. On this basis, we propose the harmonicaligned frame mask. To demonstrate feasibility in speech, we shall evaluate the proposal in the context of voweldependent speaker comparison. Frame marks between voiced signals of the same vowel but pronounced by different speakers are proposed as similarity measures for speaker characteristics to distinguish speaker identities in a limited data scenario (cf. Section 5 for details).
This paper is organized as follows. In Section 2, we briefly review frame and Gabor theory. In Section 3, we elaborate frame mask and the previous application in instrumental sound analysis. In Section 4, we develop the nonstationary Gabor transform with pitchdependent frequency resolution and propose the harmonicaligned frame mask. Section 5 presents the evaluation in voweldependent speaker identification. And finally, Section 6 concludes this study.
2 Preliminaries and Notation
2.1 Frame Theory
Denote by a sequence of signal atoms in the Hilbert space , where is a set of index. This atom sequence is a frame [3] if and only if there exist constants and , , such that
(1) 
where are the analysis coefficients. and is called the lower and upper frame bounds, respectively. The frame operator is defined by .
Given the canonical dual frame of , can be perfectly reconstructed from the analysis coefficients by
(2) 
The dual frame always exists [16], and for redundant cases there are infinitely many other duals allowing reconstruction.
2.2 Discrete Gabor Transform
We take the Hilbert space to be . Given nonzero prototype window , the translation operator and modulation operator are, respectively, defined as
where and the translation is performed modulo . For selected constants , with some such that , we take to be a regular discrete lattice, i.e., , and obtain the Gabor system [2] as
(3) 
If satisfies (1) for , it is called a Gabor frame [17]. The discrete Gabor transform (DGT) of is a matrix with . The associated frame operator reads
(4) 
The canonical dual frame of the Gabor frame is given by [18], with which can be perfectly reconstructed by
Note that the DGT coefficients are essentially sampling points of the STFT of with window at the timefrequency points , with and being the sampling step (i.e., hopsize) in time and frequency [18]. In nonstationary settings, the hopsizes are allowed to be variant (cf. Section 4).
3 Frame mask for instrumental sound analysis
3.1 Frame Mask
Consider a pair of frames and . A frame multiplier [19], denoted by , is an operator that acts on a signal by pointwise multiplication in the transform domain. The symbol is a sequence that denotes the multiplication coefficients. For signal
(5) 
Here is called a frame mask. In the considered signal analysis/transform domain, can be viewed as a transfer function.
When Gabor frames and are considered, we set . In this case the frame multiplier in (5) is known as Gabor multiplier. The corresponding frame mask is also known as Gabor mask.
3.2 For Instrument Timbre Analysis and Conversion
The application of frame masks in musical signals was investigated in [11, 12]. Based on DGT, the proposed signal model converts one sound into another by
(6) 
where are two audio signals and is the unknown mask to be estimated. An obvious solution is to set , where and are the DGT coefficients of and , respectively. However, this solution is nonstable and unbounded as the DGT coefficients in the denominator can be or very small. To guarantee existence of a stable solution, it was proposed to estimate the mask via
(7) 
with a (convex) regularization term , whose influence is controlled by the parameter [12]. As the existence of a stable solution is assured, such approach in general can be applied to arbitrary pair of signals. However, it might be difficult to interpret the estimated masks (e.g., the mask between two puretone signals with different fundamental frequencies).
Given that and are of the same note produced by different instruments, the frame mask between the two signals was found to be effective to characterize the timbre difference between the two instruments [11, 12]. Such masks were utilized as similarity measures for instrument classification and for timber morphing and conversion. Rationality of these applications roots from two aspects:

Instrumental signals of a same note possess the same fundamental frequency. Harmonic structures of the signals are naturally aligned.

DGT performs TF analysis over a regular TF lattice, and consequently preserves the property of harmonic alignment in the transform domain.
4 Frame mask for speech signals using Nonstationary Gabor transform
Similar to audio sounds of instrument notes, (voiced) speech signals are also harmonic signals. Analog to the abovementioned applications, this study explores the application of frame mask in speech signals. In particular, we consider to use voiced speech as source and target signals and to estimate the frame mask between them. We are specially interested in the case that the source and the target are of the same content, e.g., the same vowel. For such a case, a valid frame mask could measure specific variations among the signals, such as speaker variations.
Nevertheless, attempting to use (7) for speech signals, we immediately face a fundamental problem. For speech signals, the fundamental frequency usually varies over time consecutively. Therefore, harmonic structures of the source and target voice are mostly not aligned. To address this problem, we propose to employ nonstationary Gabor transform, which allows flexible timefrequency resolution [13]. Within the framework of nonstationary Gabor analysis, we intend to achieve dynamic alignment of the signals’ harmonic structures. In the following, we shall develop NSGT with pitchdependent frequency resolution to achieve harmonic alignment in the transform domain, and shall propose the harmonicaligned frame mask for speech signals on that basis.
4.1 Nonstationary Gabor Transform with Pitchdependent Frequency Resolution
We consider analyzing a voiced signal with a window that is symmetric around zero. As the stationary case in Section 2.2, we use a constant time hopsize , resulting in sampling points in time for the TF analysis. However, we set the frequency hopsize according to the fundamental frequency of the signal (see Remark 2.1 for discussion on pitch estimation issue). Following the quasistationary assumption for speech signals, we assume that the fundamental frequency is approximately fixed within the interval of the analysis window. At time , let denote the fundamental frequency in Hz, we set the corresponding frequency hopsize as
(8) 
where are a pair of parameters to be set. denotes rounding to the closest positive integer, and is the signal’s sampling rate in Hz. With (8), frequency sampling points are deployed per Hz. The total number of frequency sampling points at is hence . Consequently, we obtain the pitchdepenent nonstationary Gabor system (NSGS) as
(9) 
It is called a nonstationary Gabor frame (NSGF) if it fulfills (1) for . The sequence are the nonstationary Gabor transform coefficients. In general, due to the dynamic frequency hopsize, these coefficients do not form a matrix.
Eq. (8) features a timevarying and pitchdependent frequency resolution. More importantly, it allows harmonic alignment in the NSGT coefficients with respect to the frequency index . For example, with , for any , naturally correspond to the harmonic frequencies of the signal. The parameter allows performing partial alignment wrt. integer multiples of the th harmonic frequency.
Remark 1.
To satisfy
, zeropadding for
may be needed for an appropriate . If an extremely large is required, it is always practicable to divide the signal into segments of shorter duration using overlapandadd windows, and obtain NSGT coefficients for each segment separately. A practical example for such procedure can be found in [14].Now we consider the canonical dual . Denote by the support of the window , i.e., the interval where the window is nonzero. We choose , which is referred to as the painless case [13]. In other words, we require the frequency sampling points to be dense enough. In this painless case, we have the following [13].
Proposition 1.
If is a painlesscase NSGF, then the frame operation (cf. (4)) is an diagonal matrix with diagonal element
(10) 
And the canonical dual frame is given by
(11) 
4.2 Harmonicaligned Frame Mask
In this section, we present a general form of frame mask based on the above pitchdependent NGST. For two voiced signals , denote their fundamental frequency by and , respectively. Using (9) with the same window and the same time hopsize for both signals, we construct two Gabor systems and . Denote . To simplify the presentation of the concept without losing the frame property (1), we can consider extend the two systems as and e.g., with periodic extension to the modulation operator wrt. the index . Under such circumstance, we can denote the NGST coefficients in matrix forms as and . The harmonicaligned frame mask (HAFM) between the two voiced signals therefore acts as
(12) 
To estimate the frame mask, existing methods [11, 12] for the problem in (6) can be directly applied. For both Gabor systems and , the parameters and in (8) need to be appropriately set. We set for both systems to the same value. However, depending on specifics of the source and target signal (as well as the application purpose), the parameter may be set to different values for both systems. Example 1: If and are close (enough), we consider for both Gabor systems. This leads to a onetoone alignment of all harmonics. Example 2: If and are significantly different in value, we may consider an anchor frequency and set . This results in partial alignment of the harmonics, i.e., only the harmonics around and its multiples are aligned.
Remark 2.
1) The proposed approach practically depends on a reliable method to estimate the fundamental frequencies. A thorough discussion of such topic is beyond the scope of this paper. In the evaluation, we applied the methods in [20, 21]. 2) It may be a false impression that pitch independence is achieved in the frame masks by the harmonic alignment. On the contrary, the resulted frame mask is essentially dependent on the fundamental frequencies. It equivalently describes the variations between two spectra which are warped in a pitchdependent and linear way. It contains information related to the spectral envelopes and also highly depends on the fundamental frequencies. It is our interests to utilize the proposed mask as feature measures for classification tasks.
5 Evaluation in Contentdependent Speaker Comparison
We now evaluate harmonicaligned frame masks for speaker identity comparison in a contentdependent context. In particular, the source and target signals are of the some vowel but pronounced by different speakers. In this setting, we estimate the frame masks between an input speaker and a fixed reference speaker. For different speakers, we compare them to the same reference speaker, and use the estimated masks as speaker feature to measure and distinguish the speaker identities. It can be considered as a task of closeset speaker identification with contentdependent and limiteddata constraints (see the experimental settings in 5.1).
To estimate the harmonicaligned frame mask, we adopt the approach (7) and use transform domain proxy [11]. For our case, the first item in (7) can be written as . With diagonal approximation on the covariance matrix of NSGF , i.e., if , we estimate the mask via
(13) 
where denotes entrywise product. In this evaluation, we use the following regularization term
(14) 
With (14), the objective function in (13) is a quadratic form of , which leads to the following explicit solution
(15) 
Here denotes complex conjugate.
5.1 Experimental Settings
For experimental evaluation, we extracted two sets of English vowels, /iy/ and /u/^{2}^{2}2We use these phonetic symbols as in the database’s documents., from the TIMIT database [22]. The vowels were from speakers. For each speaker, there were samples of /iy/ as well as samples of /u/ included. The signals were downsampled at Hz. Fundamental frequency was obtained with the method proposed in [20, 21] and assumed known throughout the evaluation.
We chose from the speakers a reference speaker whose fundamental frequency was about the average of all speakers’. For the NSGT, we used Hann window with support interval of ms length. The time hopsize was set to ms. For the pitchdependent frequency hopsize, i.e., (8), we set according to pilot tests. For , we used an average value of the first formant frequency () as anchor frequency and the average of a speaker as reference and fix for the speaker. We used Hz and Hz for /iy/ and /u/, respectively [23]. For (15), we empirically set (allones) and . Part of the routines in the LTFAT toolbox [1, 24] were used to implement the NSGT.
For each vowel type, the frame masks for an input speaker were computed from pairs of signals ^{3}^{3}3As there were also samples from the reference speaker.. To obtain a variety of masks, for a signal pair we computed the frame masks as illustrated in Fig. 1. Hence, and in (15
) were onecolumnwise for the feature extraction. The obtained mask vectors were used as speaker feature vectors. We employed fully connected deep neural network (DNN) for the evaluation. The feature vectors were divided in the following way for training and testing. For each speaker,
of the speaker’s masks were randomly selected as training data, and the rest were used for testing. The DNN structure was set as . For DNN training, the following settings were used [25, 26, 27]. The number of epoch for RBM pretraining was
, with learning rate set as . The number of epochs for DNN finetuning was , where in the first epochs only the parameters of the output layer were adjusted. The minibatch size was set to .5.2 Results
Fig. 2 shows performance of the harmonicaligned frame mask (HAFM) in the voweldependent speaker classification tasks. For comparison, the melfrequency cepstral coefficients (MFCC) [28] and the NSGT coefficients (CNSGT) were also evaluated in the same way. We also tested the condition that was included as an extra feature dimension. It can be seen from the results that CNSGT mostly performed the worst. On the other hand, HAFM which is established based on CNSGT outperforms the others with noticeably higher accuracy. This implies that with the comparison way of feature extraction, the HAFM feature is more effective to capture and represent the speaker variations. The accuracy of HAFM is for the “DNN/iy/+DNN/u/” case (i.e., DNNs of both vowels were combined for decision). It can also be noticed that to include as extra feature seems beneficial for MFCC. However, such benefit is generally not observed for both CNGST and HAFM, as related information has already been well incorporated in these features.
In the evaluation, it was also observed that the frame mask based DNNs performed extremely well in distinguishing the reference speaker from the rest of the speakers. As the frame mask features were obtained by exhaustive comparison to the reference speaker, the resulted DNN were inherently good verification models for the reference speaker. One of our future directions is to combine the verification models of all enrolled speakers to construct a more comprehensive system.
6 Conclusions
The frame mask approach has been extended from instrumental sound analysis to voiced speech analysis. We have addressed the related issue by developing nonstationary Gabor transform (NSGT) with pitchdependent and timevarying frequency resolution. The transform allows effective harmonic alignment in the transform domain. On this basis, harmonicaligned frame mask has been proposed for voiced speech signals. We have applied the proposed frame mask as similarity measure to compare and distinguish speaker identities, and have evaluated the proposal in a voweldependent and limiteddata setting. Results confirm that the proposed frame mask is feasible for speech applications. It is effective in representing speaker characteristics in the contentdependent context and shows a potential for speaker identity related applications, specially for limited data scenarios.
References
 [1] P. Søndergaard, B. Torrésani, and P. Balazs, “The linear time frequency analysis toolbox,” International Journal of Wavelets, Multiresolution and Information Processing, vol. 10, no. 4, p. 1250032, 2012. [Online]. Available: http://ltfat.github.io/
 [2] D. Gabor, “Theory of communication,” J. IEE  Part I: General, vol. 94, no. 73, pp. 429–457, January 1947.
 [3] S. Mallat, A Wavelet Tour of Signal Processing  The Sparse Way, 3rd ed. Academic Press, 2009.
 [4] K. Gröchenig, Foundations of TimeFrequency Analysis. Boston, MA, USA, 2001.
 [5] H. G. Feichtinger and K. Nowak, A first survey of Gabor multipliers, 2003, ch. 5, pp. 99–128.
 [6] D. T. Stoeva and P. Balazs, “Invertibility of multipliers,” Applied and Computational Harmonic Analysis, vol. 33, no. 2, pp. 292–299, 2012.
 [7] P. Balazs and D. T. Stoeva, “Representation of the inverse of a multiplier,” Journal of Mathematical Analysis and Applications, vol. 422, pp. 981–994, 2015.
 [8] F. Hlawatsch, G. Matz, H. Kirchauer, and W. Kozek, “Timefrequency formulation, design, and implementation of timevarying optimal filters for signal estimation,” IEEE Transactions on Signal Processing, vol. 48, no. 5, pp. 1417 –1432, May 2000.
 [9] P. Balazs, B. Laback, G. Eckel, and W. A. Deutsch, “Timefrequency sparsity by removing perceptually irrelevant components using a simple model of simultaneous masking,” IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no. 1, pp. 34–49, 2010.
 [10] P. Majdak, P. Balazs, W. Kreuzer, and M. Dörfler, “A timefrequency method for increasing the signaltonoise ratio in system identification with exponential sweeps,” in Proc. 36th International Conference on Acoustics, Speech and Signal Processing, ICASSP 2011, Prag, 2011.
 [11] P. Depalle, R. KronlandMartinet, and B. Torrésani, “Timefrequency multipliers for sound synthesis,” in Proc. SPIE, Wavelets XII, 2007, pp. 221–224.
 [12] A. Olivero, B. Torresani, and R. KronlandMartinet, “A class of algorithms for timefrequency multiplier estimation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 8, pp. 1550–1559, Aug 2013.
 [13] P. Balazs, M. Dörfler, F. Jaillet, N. Holighaus, and G. Velasco, “Theory, implementation and applications of nonstationary gabor frames,” Journal of Computational and Applied Mathematics, vol. 236, no. 6, pp. 1481 – 1496, 2011.
 [14] N. Holighaus, M. Dörfler, G. A. Velasco, and T. Grill, “A framework for invertible, realtime constantq transforms,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 4, pp. 775–785, April 2013.
 [15] E. S. Ottosen and M. Dörfler, “A phase vocoder based on nonstationary Gabor frames,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 11, pp. 2199–2208, Nov 2017.
 [16] P. Casazza, “The art of frame theory,” Taiwanese J. Math., vol. 4, no. 2, pp. 129–202, 2000.
 [17] O. Christensen, An Introduction to Frames and Riesz Bases. Birkhäuser Boston, 2003.
 [18] H. G. Feichtinger and T. Strohmer, Gabor Analysis and Algorithms  Theory and Applications. Birkhäuser Boston, 1998.
 [19] P. Balazs, “Basic definition and properties of Bessel multipliers,” Journal of Mathematical Analysis and Applications, vol. 325, no. 1, pp. 571 – 585, 2007.
 [20] F. Huang and T. Lee, “Pitch estimation in noisy speech using accumulated peak spectrum and sparse estimation technique,” IEEE Trans. Audio, Speech and Lang. Proc., vol. 21, no. 1, pp. 99–109, Jan. 2013.

[21]
F. Huang and P. Balazs, “Dictionary learning for pitch estimation in
speech signals,” in
Proc. 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP)
, Sep. 2017, pp. 1–6.  [22] “DARPA TIMIT acoustic phonetic continuous speech corpus CDROM,” 1993. [Online]. Available: http://www.ldc.upenn.edu/Catalog/LDC93S1.html
 [23] P. Ladefoged and K. Johnson, A course in phonetics, 6th ed. Boston, MA: Wadsworth, Cengage Learning, 2011.
 [24] Z. Průša, P. L. Søndergaard, N. Holighaus, C. Wiesmeyr, and P. Balazs, “The Large TimeFrequency Analysis Toolbox 2.0,” in Sound, Music, and Motion, ser. Lecture Notes in Computer Science, M. Aramaki, O. Derrien, R. KronlandMartinet, and S. Ystad, Eds. Springer International Publishing, 2014, pp. 419–442.
 [25] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, Jul. 2006.

[26]
G. Hinton, “A Practical Guide to Training Restricted Boltzmann Machines,” Tech. Rep., 2010. [Online]. Available:
http://www.cs.toronto.edu/~hinton/absps/guideTR.pdf  [27] Y. Xu, J. Du, L.R. Dai, and C.H. Lee, “An experimental study on speech enhancement based on deep neural networks,” IEEE Signal Processing Letters, vol. 21, no. 1, pp. 65–68, Jan 2014.
 [28] Z. Fang, Z. Guoliang, and S. Zhanjiang, “Comparison of different implementations of mfcc,” J. Comput. Sci. Technol., vol. 16, no. 6, pp. 582–589, Nov. 2001.
Comments
There are no comments yet.