Time-frequency (TF) analysis is the foundation of audio and speech signal processing. The short-time Fourier transform (STFT) is a widely used tool, which can be effectively implemented by FFT. STFT features straightforward interpretation of a signal. It provides uniform time and frequency resolution with linearly-spaced TF bins. The corresponding theory was generalized in the framework of Gabor analysis and Gabor frames [2, 3, 4].
Signal synthesis is an important application area of time-frequency transforms. Signal modification, denoising, separation and so on can be achieved by manipulating the analysis coefficients to synthesize a desired one. The theory of Gabor multiplier  or, in general terms, frame multiplier [6, 7] provides a basis for the stability and invertibility of such operations. A frame multiplier is an operator that converts a signal into another by pointwise multiplication in the transform domain for resynthesis. The sequence of multiplication coefficients is called a frame mask (or symbol). Such operators allow easy implementation of time-varying filters . They have been used in perceptual sparsity , denoising  and signal synthesis 
. Algorithms to estimate frame mask between audio signals were investigated in[11, 12], where it was demonstrated that the frame mask between two instrumental sounds (of a same note) was an effective measure to characterize timber variations between the instruments. Such masks were used for timber morphing and instrument categorization. In this paradigm, the two signals were of the same fundamental frequency and their harmonics were naturally aligned, which vouched for the prominence of the obtained mask for TF analysis/synthesis with uniform resolution.
This study extends the frame mask method to speech signals. One intrinsic property of (voiced) speech signal is that the fundamental frequency ( or pitch) varies consecutively over time. Therefore, the harmonic structures are not well aligned when comparing two signals. We propose to employ the non-stationary Gabor transform (NSGT)  to tackle this issue. NSGT provides flexible time-frequency resolution by incorporating dynamic time/frequency hop-size and dynamic analysis windows [13, 14, 15]. We develop an NSGT whose frequency resolution changes over time. We set the frequency hop-size in ratio to to achieve harmonic alignment (or partial alignment cf. Section 4) in the transform domain. On this basis, we propose the harmonic-aligned frame mask. To demonstrate feasibility in speech, we shall evaluate the proposal in the context of vowel-dependent speaker comparison. Frame marks between voiced signals of the same vowel but pronounced by different speakers are proposed as similarity measures for speaker characteristics to distinguish speaker identities in a limited data scenario (cf. Section 5 for details).
This paper is organized as follows. In Section 2, we briefly review frame and Gabor theory. In Section 3, we elaborate frame mask and the previous application in instrumental sound analysis. In Section 4, we develop the non-stationary Gabor transform with pitch-dependent frequency resolution and propose the harmonic-aligned frame mask. Section 5 presents the evaluation in vowel-dependent speaker identification. And finally, Section 6 concludes this study.
2 Preliminaries and Notation
2.1 Frame Theory
Denote by a sequence of signal atoms in the Hilbert space , where is a set of index. This atom sequence is a frame  if and only if there exist constants and , , such that
where are the analysis coefficients. and is called the lower and upper frame bounds, respectively. The frame operator is defined by .
Given the canonical dual frame of , can be perfectly reconstructed from the analysis coefficients by
The dual frame always exists , and for redundant cases there are infinitely many other duals allowing reconstruction.
2.2 Discrete Gabor Transform
We take the Hilbert space to be . Given non-zero prototype window , the translation operator and modulation operator are, respectively, defined as
where and the translation is performed modulo . For selected constants , with some such that , we take to be a regular discrete lattice, i.e., , and obtain the Gabor system  as
The canonical dual frame of the Gabor frame is given by , with which can be perfectly reconstructed by
Note that the DGT coefficients are essentially sampling points of the STFT of with window at the time-frequency points , with and being the sampling step (i.e., hop-size) in time and frequency . In non-stationary settings, the hop-sizes are allowed to be variant (cf. Section 4).
3 Frame mask for instrumental sound analysis
3.1 Frame Mask
Consider a pair of frames and . A frame multiplier , denoted by , is an operator that acts on a signal by pointwise multiplication in the transform domain. The symbol is a sequence that denotes the multiplication coefficients. For signal
Here is called a frame mask. In the considered signal analysis/transform domain, can be viewed as a transfer function.
When Gabor frames and are considered, we set . In this case the frame multiplier in (5) is known as Gabor multiplier. The corresponding frame mask is also known as Gabor mask.
3.2 For Instrument Timbre Analysis and Conversion
where are two audio signals and is the unknown mask to be estimated. An obvious solution is to set , where and are the DGT coefficients of and , respectively. However, this solution is non-stable and unbounded as the DGT coefficients in the denominator can be or very small. To guarantee existence of a stable solution, it was proposed to estimate the mask via
with a (convex) regularization term , whose influence is controlled by the parameter . As the existence of a stable solution is assured, such approach in general can be applied to arbitrary pair of signals. However, it might be difficult to interpret the estimated masks (e.g., the mask between two pure-tone signals with different fundamental frequencies).
Given that and are of the same note produced by different instruments, the frame mask between the two signals was found to be effective to characterize the timbre difference between the two instruments [11, 12]. Such masks were utilized as similarity measures for instrument classification and for timber morphing and conversion. Rationality of these applications roots from two aspects:
Instrumental signals of a same note possess the same fundamental frequency. Harmonic structures of the signals are naturally aligned.
DGT performs TF analysis over a regular TF lattice, and consequently preserves the property of harmonic alignment in the transform domain.
4 Frame mask for speech signals using Non-stationary Gabor transform
Similar to audio sounds of instrument notes, (voiced) speech signals are also harmonic signals. Analog to the above-mentioned applications, this study explores the application of frame mask in speech signals. In particular, we consider to use voiced speech as source and target signals and to estimate the frame mask between them. We are specially interested in the case that the source and the target are of the same content, e.g., the same vowel. For such a case, a valid frame mask could measure specific variations among the signals, such as speaker variations.
Nevertheless, attempting to use (7) for speech signals, we immediately face a fundamental problem. For speech signals, the fundamental frequency usually varies over time consecutively. Therefore, harmonic structures of the source and target voice are mostly not aligned. To address this problem, we propose to employ non-stationary Gabor transform, which allows flexible time-frequency resolution . Within the framework of non-stationary Gabor analysis, we intend to achieve dynamic alignment of the signals’ harmonic structures. In the following, we shall develop NSGT with pitch-dependent frequency resolution to achieve harmonic alignment in the transform domain, and shall propose the harmonic-aligned frame mask for speech signals on that basis.
4.1 Non-stationary Gabor Transform with Pitch-dependent Frequency Resolution
We consider analyzing a voiced signal with a window that is symmetric around zero. As the stationary case in Section 2.2, we use a constant time hop-size , resulting in sampling points in time for the TF analysis. However, we set the frequency hop-size according to the fundamental frequency of the signal (see Remark 2.1 for discussion on pitch estimation issue). Following the quasi-stationary assumption for speech signals, we assume that the fundamental frequency is approximately fixed within the interval of the analysis window. At time , let denote the fundamental frequency in Hz, we set the corresponding frequency hop-size as
where are a pair of parameters to be set. denotes rounding to the closest positive integer, and is the signal’s sampling rate in Hz. With (8), frequency sampling points are deployed per Hz. The total number of frequency sampling points at is hence . Consequently, we obtain the pitch-depenent non-stationary Gabor system (NSGS) as
It is called a non-stationary Gabor frame (NSGF) if it fulfills (1) for . The sequence are the non-stationary Gabor transform coefficients. In general, due to the dynamic frequency hop-size, these coefficients do not form a matrix.
Eq. (8) features a time-varying and pitch-dependent frequency resolution. More importantly, it allows harmonic alignment in the NSGT coefficients with respect to the frequency index . For example, with , for any , naturally correspond to the harmonic frequencies of the signal. The parameter allows performing partial alignment wrt. integer multiples of the -th harmonic frequency.
, zero-padding formay be needed for an appropriate . If an extremely large is required, it is always practicable to divide the signal into segments of shorter duration using overlap-and-add windows, and obtain NSGT coefficients for each segment separately. A practical example for such procedure can be found in .
Now we consider the canonical dual . Denote by the support of the window , i.e., the interval where the window is nonzero. We choose , which is referred to as the painless case . In other words, we require the frequency sampling points to be dense enough. In this painless case, we have the following .
If is a painless-case NSGF, then the frame operation (cf. (4)) is an diagonal matrix with diagonal element
And the canonical dual frame is given by
4.2 Harmonic-aligned Frame Mask
In this section, we present a general form of frame mask based on the above pitch-dependent NGST. For two voiced signals , denote their fundamental frequency by and , respectively. Using (9) with the same window and the same time hop-size for both signals, we construct two Gabor systems and . Denote . To simplify the presentation of the concept without losing the frame property (1), we can consider extend the two systems as and e.g., with periodic extension to the modulation operator wrt. the index . Under such circumstance, we can denote the NGST coefficients in matrix forms as and . The harmonic-aligned frame mask (HAFM) between the two voiced signals therefore acts as
To estimate the frame mask, existing methods [11, 12] for the problem in (6) can be directly applied. For both Gabor systems and , the parameters and in (8) need to be appropriately set. We set for both systems to the same value. However, depending on specifics of the source and target signal (as well as the application purpose), the parameter may be set to different values for both systems. Example 1: If and are close (enough), we consider for both Gabor systems. This leads to a one-to-one alignment of all harmonics. Example 2: If and are significantly different in value, we may consider an anchor frequency and set . This results in partial alignment of the harmonics, i.e., only the harmonics around and its multiples are aligned.
1) The proposed approach practically depends on a reliable method to estimate the fundamental frequencies. A thorough discussion of such topic is beyond the scope of this paper. In the evaluation, we applied the methods in [20, 21]. 2) It may be a false impression that pitch independence is achieved in the frame masks by the harmonic alignment. On the contrary, the resulted frame mask is essentially dependent on the fundamental frequencies. It equivalently describes the variations between two spectra which are warped in a pitch-dependent and linear way. It contains information related to the spectral envelopes and also highly depends on the fundamental frequencies. It is our interests to utilize the proposed mask as feature measures for classification tasks.
5 Evaluation in Content-dependent Speaker Comparison
We now evaluate harmonic-aligned frame masks for speaker identity comparison in a content-dependent context. In particular, the source and target signals are of the some vowel but pronounced by different speakers. In this setting, we estimate the frame masks between an input speaker and a fixed reference speaker. For different speakers, we compare them to the same reference speaker, and use the estimated masks as speaker feature to measure and distinguish the speaker identities. It can be considered as a task of close-set speaker identification with content-dependent and limited-data constraints (see the experimental settings in 5.1).
To estimate the harmonic-aligned frame mask, we adopt the approach (7) and use transform domain proxy . For our case, the first item in (7) can be written as . With diagonal approximation on the covariance matrix of NSGF , i.e., if , we estimate the mask via
where denotes entrywise product. In this evaluation, we use the following regularization term
Here denotes complex conjugate.
5.1 Experimental Settings
For experimental evaluation, we extracted two sets of English vowels, /iy/ and /u/222We use these phonetic symbols as in the database’s documents., from the TIMIT database . The vowels were from speakers. For each speaker, there were samples of /iy/ as well as samples of /u/ included. The signals were down-sampled at Hz. Fundamental frequency was obtained with the method proposed in [20, 21] and assumed known throughout the evaluation.
We chose from the speakers a reference speaker whose fundamental frequency was about the average of all speakers’. For the NSGT, we used Hann window with support interval of ms length. The time hop-size was set to ms. For the pitch-dependent frequency hop-size, i.e., (8), we set according to pilot tests. For , we used an average value of the first formant frequency () as anchor frequency and the average of a speaker as reference and fix for the speaker. We used Hz and Hz for /iy/ and /u/, respectively . For (15), we empirically set (all-ones) and . Part of the routines in the LTFAT toolbox [1, 24] were used to implement the NSGT.
For each vowel type, the frame masks for an input speaker were computed from pairs of signals 333As there were also samples from the reference speaker.. To obtain a variety of masks, for a signal pair we computed the frame masks as illustrated in Fig. 1. Hence, and in (15
) were one-columnwise for the feature extraction. The obtained mask vectors were used as speaker feature vectors. We employed fully connected deep neural network (DNN) for the evaluation. The feature vectors were divided in the following way for training and testing. For each speaker,of the speaker’s masks were randomly selected as training data, and the rest were used for testing. The DNN structure was set as . For DNN training, the following settings were used [25, 26, 27]
. The number of epoch for RBM pre-training was, with learning rate set as . The number of epochs for DNN fine-tuning was , where in the first epochs only the parameters of the output layer were adjusted. The mini-batch size was set to .
Fig. 2 shows performance of the harmonic-aligned frame mask (HAFM) in the vowel-dependent speaker classification tasks. For comparison, the mel-frequency cepstral coefficients (MFCC)  and the NSGT coefficients (C-NSGT) were also evaluated in the same way. We also tested the condition that was included as an extra feature dimension. It can be seen from the results that C-NSGT mostly performed the worst. On the other hand, HAFM which is established based on C-NSGT outperforms the others with noticeably higher accuracy. This implies that with the comparison way of feature extraction, the HAFM feature is more effective to capture and represent the speaker variations. The accuracy of HAFM is for the “DNN/iy/+DNN/u/” case (i.e., DNNs of both vowels were combined for decision). It can also be noticed that to include as extra feature seems beneficial for MFCC. However, such benefit is generally not observed for both C-NGST and HAFM, as related information has already been well incorporated in these features.
In the evaluation, it was also observed that the frame mask based DNNs performed extremely well in distinguishing the reference speaker from the rest of the speakers. As the frame mask features were obtained by exhaustive comparison to the reference speaker, the resulted DNN were inherently good verification models for the reference speaker. One of our future directions is to combine the verification models of all enrolled speakers to construct a more comprehensive system.
The frame mask approach has been extended from instrumental sound analysis to voiced speech analysis. We have addressed the related issue by developing non-stationary Gabor transform (NSGT) with pitch-dependent and time-varying frequency resolution. The transform allows effective harmonic alignment in the transform domain. On this basis, harmonic-aligned frame mask has been proposed for voiced speech signals. We have applied the proposed frame mask as similarity measure to compare and distinguish speaker identities, and have evaluated the proposal in a vowel-dependent and limited-data setting. Results confirm that the proposed frame mask is feasible for speech applications. It is effective in representing speaker characteristics in the content-dependent context and shows a potential for speaker identity related applications, specially for limited data scenarios.
-  P. Søndergaard, B. Torrésani, and P. Balazs, “The linear time frequency analysis toolbox,” International Journal of Wavelets, Multiresolution and Information Processing, vol. 10, no. 4, p. 1250032, 2012. [Online]. Available: http://ltfat.github.io/
-  D. Gabor, “Theory of communication,” J. IEE - Part I: General, vol. 94, no. 73, pp. 429–457, January 1947.
-  S. Mallat, A Wavelet Tour of Signal Processing - The Sparse Way, 3rd ed. Academic Press, 2009.
-  K. Gröchenig, Foundations of Time-Frequency Analysis. Boston, MA, USA, 2001.
-  H. G. Feichtinger and K. Nowak, A first survey of Gabor multipliers, 2003, ch. 5, pp. 99–128.
-  D. T. Stoeva and P. Balazs, “Invertibility of multipliers,” Applied and Computational Harmonic Analysis, vol. 33, no. 2, pp. 292–299, 2012.
-  P. Balazs and D. T. Stoeva, “Representation of the inverse of a multiplier,” Journal of Mathematical Analysis and Applications, vol. 422, pp. 981–994, 2015.
-  F. Hlawatsch, G. Matz, H. Kirchauer, and W. Kozek, “Time-frequency formulation, design, and implementation of time-varying optimal filters for signal estimation,” IEEE Transactions on Signal Processing, vol. 48, no. 5, pp. 1417 –1432, May 2000.
-  P. Balazs, B. Laback, G. Eckel, and W. A. Deutsch, “Time-frequency sparsity by removing perceptually irrelevant components using a simple model of simultaneous masking,” IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no. 1, pp. 34–49, 2010.
-  P. Majdak, P. Balazs, W. Kreuzer, and M. Dörfler, “A time-frequency method for increasing the signal-to-noise ratio in system identification with exponential sweeps,” in Proc. 36th International Conference on Acoustics, Speech and Signal Processing, ICASSP 2011, Prag, 2011.
-  P. Depalle, R. Kronland-Martinet, and B. Torrésani, “Time-frequency multipliers for sound synthesis,” in Proc. SPIE, Wavelets XII, 2007, pp. 221–224.
-  A. Olivero, B. Torresani, and R. Kronland-Martinet, “A class of algorithms for time-frequency multiplier estimation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 8, pp. 1550–1559, Aug 2013.
-  P. Balazs, M. Dörfler, F. Jaillet, N. Holighaus, and G. Velasco, “Theory, implementation and applications of nonstationary gabor frames,” Journal of Computational and Applied Mathematics, vol. 236, no. 6, pp. 1481 – 1496, 2011.
-  N. Holighaus, M. Dörfler, G. A. Velasco, and T. Grill, “A framework for invertible, real-time constant-q transforms,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 4, pp. 775–785, April 2013.
-  E. S. Ottosen and M. Dörfler, “A phase vocoder based on nonstationary Gabor frames,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 11, pp. 2199–2208, Nov 2017.
-  P. Casazza, “The art of frame theory,” Taiwanese J. Math., vol. 4, no. 2, pp. 129–202, 2000.
-  O. Christensen, An Introduction to Frames and Riesz Bases. Birkhäuser Boston, 2003.
-  H. G. Feichtinger and T. Strohmer, Gabor Analysis and Algorithms - Theory and Applications. Birkhäuser Boston, 1998.
-  P. Balazs, “Basic definition and properties of Bessel multipliers,” Journal of Mathematical Analysis and Applications, vol. 325, no. 1, pp. 571 – 585, 2007.
-  F. Huang and T. Lee, “Pitch estimation in noisy speech using accumulated peak spectrum and sparse estimation technique,” IEEE Trans. Audio, Speech and Lang. Proc., vol. 21, no. 1, pp. 99–109, Jan. 2013.
F. Huang and P. Balazs, “Dictionary learning for pitch estimation in
speech signals,” in
Proc. 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), Sep. 2017, pp. 1–6.
-  “DARPA TIMIT acoustic phonetic continuous speech corpus CDROM,” 1993. [Online]. Available: http://www.ldc.upenn.edu/Catalog/LDC93S1.html
-  P. Ladefoged and K. Johnson, A course in phonetics, 6th ed. Boston, MA: Wadsworth, Cengage Learning, 2011.
-  Z. Průša, P. L. Søndergaard, N. Holighaus, C. Wiesmeyr, and P. Balazs, “The Large Time-Frequency Analysis Toolbox 2.0,” in Sound, Music, and Motion, ser. Lecture Notes in Computer Science, M. Aramaki, O. Derrien, R. Kronland-Martinet, and S. Ystad, Eds. Springer International Publishing, 2014, pp. 419–442.
-  G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, Jul. 2006.
G. Hinton, “A Practical Guide to Training Restricted Boltzmann Machines,” Tech. Rep., 2010. [Online]. Available:http://www.cs.toronto.edu/~hinton/absps/guideTR.pdf
-  Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental study on speech enhancement based on deep neural networks,” IEEE Signal Processing Letters, vol. 21, no. 1, pp. 65–68, Jan 2014.
-  Z. Fang, Z. Guoliang, and S. Zhanjiang, “Comparison of different implementations of mfcc,” J. Comput. Sci. Technol., vol. 16, no. 6, pp. 582–589, Nov. 2001.