Pitch-Synchronous Single Frequency Filtering Spectrogram for Speech Emotion Recognition

08/07/2019 ∙ by Shruti Gupta, et al. ∙ NIT Patna 5

Convolutional neural networks (CNN) are widely used for speech emotion recognition (SER). In such cases, the short time fourier transform (STFT) spectrogram is the most popular choice for representing speech, which is fed as input to the CNN. However, the uncertainty principles of the short-time Fourier transform prevent it from capturing time and frequency resolutions simultaneously. On the other hand, the recently proposed single frequency filtering (SFF) spectrogram promises to be a better alternative because it captures both time and frequency resolutions simultaneously. In this work, we explore the SFF spectrogram as an alternative representation of speech for SER. We have modified the SFF spectrogram by taking the average of the amplitudes of all the samples between two successive glottal closure instants (GCI) locations. The duration between two successive GCI locations gives the pitch, motivating us to name the modified SFF spectrogram as pitch-synchronous SFF spectrogram. The GCI locations were detected using zero frequency filtering approach. The proposed pitch-synchronous SFF spectrogram produced accuracy values of 63.95 These correspond to an improvement of +7.35 over state-of-the-art result on the STFT sepctrogram using CNN. Specially, the proposed method recognized 22.7 whereas this number was 0 promise a much wider use of the proposed pitch-synchronous SFF spectrogram for other speech-based applications.



There are no comments yet.


page 1

page 3

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

[findent=0.5pt]Speech emotion recognition (SER) refers to the classification/recognition of the person’s emotional state using the speech signal. SER has a lot of applications in real life. It can be beneficial for applications where natural human-computer interaction is required. In computer tutorial applications, the detected emotion of the user can help the system in responding to the user’s query [11]. It can be incorporated in the onboard system of a car for initiating the safety of a passenger depending on his mental state [33]. Medical professionals can use it as a diagnostic tool in psychological treatment [13]. In automatic translation systems, it can help in effectively conveying the emotions between two parties[2].

For recognizing emotions from a speech, we need to extract emotion specific features, which are invariant with text[37]

. After the introduction of deep neural networks, SER has gone through a significant growth over the past few years. There are two ways for extracting features from speech: (i) hand-crafted features and (ii) deep neural network based features. Popular hand-crafted features are formant locations/bandwidths, pitch, voice probability, zero-crossing rate, harmonics-to-noise ratio, mel filter bank features, mel frequency cepstral coefficients (MFCCs), energy, and jitter

[22, 24, 34, 1]

. There are two disadvantages with hand crafted features. First is the inherent inaccuracies in extraction of these features. Second, the handcrafted features tend to overlook the higher level features that can be derived from the lower level features. On the other hand, in deep learning features are learned in a hierarchical manner; learning higher level abstractions of the low-level features.

Recently convolution neural networks (CNN) have become very popular for SER. CNN is a data-driven feature extractor in which filters are learned using data itself [28, 23, 35, 5, 6]. Short time Fourier transform (STFT) spectrogram is widely used as an input to the CNN for speech applications [23, 32, 5, 12]. Spectrogram is a time-frequency representation of a signal. There are various existing works in SER [23, 32, 5, 12] in which spectrogram was given as input to a CNN. In [23] the spectrogram was used as input to a CNN with sparse auto-encoder to find salient features for SER. In [5]

experiments were conducted using pre-trained AlexNet model and a freshly trained CNN model. The pre-trained AlexNet model was fine-tuned for transfer learning. The experimental results showed that the proposed approach based on the freshly trained CNN model were better than the pre-trained AlexNet model. In

[12] experiments were conducted on several CNN and LSTM-RNN architectures. The CNN architectures achieved better performance as compared to the other LSTM-RNN architectures.

Reference [32] highlights the drawback of spectrogram representation for noisy data. The authors modified the spectrogram using pitch information for robust SER. They also commented on Mel-scale spectrograms vs. linear-spaced spectrogram. Mel-scale spectrograms remove the pitch information but the emotions are strongly correlated with the pitch information. Therefore, linear-spaced spectrogram is usually used for speech emotion recognition. In [40] spectrogram along with phoneme embedding features are combined with a CNN model to retain emotional contents of speech. Both spectrogram and hand-crafted features have been used as input to the CNN for SER. However, as noted in [25], the raw spectrogram has higher accuracy than the hand crafted features because the latter are already decorrelated.

Speech signal is a non-stationary signal, i.e., it’s behaviour changes with time. It is processed frame-wise, where the size of a frame is small enough (typically around 25 ms) to assume that the characteristics of speech is stationary within a frame. In spectrogram, Fourier transform of a short-time windowed signal is taken with some overlapping (typically around 10 ms). The shorter sized window (wide-band spectrogram) contains high temporal resolution but less spectral resolution, whereas the larger sized window (narrow-band spectrogram) contains high spectral resolution but less temporal resolution [7]. In other words, STFT spectrogram is unable to obtain high resolutions of time and frequency simultaneously. Since, the pattern of emotional speech varies rapidly within a glottal cycle due to the opening and closing of the glottis [17, 38], it requires high temporal resolution for further analysis.

Wavelet transform [16] is another time frequency representation that overcomes the limitation of the STFT spectrogram to some extent. Here, the time resolution is better in the high frequency region, while frequency resolution is better in the low frequency region. However, in wavelet transform, the type of wavelets have to be chosen a priori, which often leads to misinterpretation of the data. On the other hand, the newly proposed single frequency filtering (SFF) spectrogram [3, 4, 29, 17] has both high temporal and spectral resolution without any requirement of a priori choices. Further, in absence of sensitive parameters like window-size (as in the case of STFT), SFF is more robust. SFF technique has been successfully used in speech/non-speech detection [3], fundamental frequency extraction [4, 29] and glottal closure instants (GCIs) detection [17]. To the best of our knowledge SFF has not been used for SER.

SFF spectrogram is especially appealing for SER because the behaviour of emotional speech is more non-stationary than normal speech. SFF output contains both high temporal and spectral resolutions. Hence, the SFF spectrogram captures transient parts of the emotional speech very well. The high spectral resolution clearly represents harmonic structures. The harmonics patterns, called timbre spectrum, are important for emotion classification [20, 14]. However, there is a practical problem in using SFF spectrogram for SER. SFF spectrogram extracts the amplitude envelope at each sampling instant. This leads to a huge feature matrix, making it computationally prohibitive to be used as the input to a CNN. Hence, a logical way is required to reduce the size of the feature matirx of SFF spectrogram.

Based on the above discussion, there are two primary motivations of our work:

  • Use of SFF Spectrogram for SER. The advantages offered by the SFF spectrogram over the STFT spectrogram and the wavelet transform technique make a strong case for exploring SFF spectrogram for SER.

  • Reduction of the size of the feature matrix of SFF. A logical way is required to reduce the size of the feature matirx of SFF spectrogram to make its use computationally practical.

In light of the above motivations, the contributions of our work are as follows:

  • Pitch-synchronous SFF spectrogram: We have proposed a novel pitch-synchronous SFF spectrogram, where the amplitude is averaged over pitch cycles. As compared to block processing – where the assumption that the emotional speech is stationary over fixed-size frames does not hold good practically – the proposed approach logically decomposes frames based on GCI locations; an emotional speech is near stationary between two successive GCI locations. As a result, the unique pitch pattern of every emotion is reflected well in the pitch-synchronous SFF spectrogram (details in Sec. IV). To overcome the problem of a huge feature matrix, we have averaged the amplitudes over pitch cycles, hence, reducing the size of the feature matrix. This also motivated us to name the modified SFF spectrogram as pitch-synchronous SFF spectrogram.

  • Improved performance: We have compared pitch-synchronous SFF spectrogram with (i) STFT spectrogram and (ii) SFF spectrogram with 20 ms fixed sized frames (called “SFF-20 ms spectrogram”). All representations were evaluated on a developed CNN model. The SFF-20 ms spectrogram achieves an improvement of 2.49% for unweighted accuracy and 1.74% for weighted accuracy over state-of-the-art STFT spectrogram using CNN [32]. Pitch-synchronous SFF spectrogram further improves upon this by achieving improvements of 7.35% for unweighted accuracy and 4.3% for weighted accuracy over state-of-the-art STFT spectrogram using CNN [32].

The rest of the paper is structured as follows. In Section ii@, the proposed pitch-synchronous SFF spectrogram and CNN architecture are discussed in detail. Section iii@ describes experimental setup. Results are discussed in Sec. iv@. Section v@ concludes the paper with future directions.

Ii Proposed Approach

Our proposed work utilizes the SFF spectrogram [3, 4, 29, 17], which is a time-frequency representation, for SER. The SFF spectrogram derives the amplitude envelope of a signal at each frequency as a function of time. In this way, the SFF spectrogram captures the temporal variation at each sample in an emotional speech. Note that this variation can also be captured by the discrete Fourier transform (DFT) over a block of data at every sampling instant. But, this process is computationally expensive since the DFT is performed at each sample. The SFF spectrogram extracts the amplitude envelope at each sampling instant; resulting in a huge feature matrix. Computationally, it is impractical to use such a huge feature matrix as an input to CNNs. To overcome this, the amplitude envelope is averaged for all the samples between two successive GCI locations (called the pitch period). The resultant output representation is called pitch-synchronous SFF spectrogram. In section II-A, the detailed steps of constructing the pitch-synchronous SFF spectrogram are described. In Sec. II-B glottal closure instants (GCI) detection using zero frequency filtering (ZFF) method is described. In Sec. II-C the proposed deep CNN architecture is described.

Ii-A Pitch synchronous SFF spectrogram

Following are the steps to obtain the SFF spectrogram output.

  1. The speech signal , sampled at frequency Hz, is pre-emphasized () to remove the low frequency bias from the speech signal.

  2. The pre-emphasized speech signal is multiplied by a complex sinusoid of normalized shifted frequency as follows:


    is the resultant output at sample and
    is the normalized version of frequency at filter. is computed as:


    is the shifted frequency shifted as .

    In the above, and , where is the total number of samples and is the total number of filters. If the spacing between the filters is Hz, the total number of filters are .

  3. The resultant output is passed through a single pole filter at whose transfer function is given by


    The stability of the filter is ensured by choosing the value of the pole of the filter inside the unit circle (radius . The output of SFF gives high resolution at each frequency because it considers the effect of one resonator at a time and reduces the effect of other resonators significantly.

  4. The output of the filter is given by


    is a complex number with real part and imaginary part .

  5. The SFF envelope, denoted , of the filtered output at the filter is given by


    In out work, the SFF envelope is calculated for frequency range Hz with Hz spacing. The SFF envelope of an anger utterance is shown in Fig 1.

  6. The GCI locations from the speech signal are detected using the ZFF algorithm [26] described in Sec. II-B. Let denote the ZFF signal and denote the set of GCI locations extracted from .

  7. Subsampling: It refers to the process of reducing the number of samples by averaging the samples between two successive GCI locations in the SFF envelope. The average values are computed as:


    where is the GCI location and iterates over the samples between successive GCI locations and . The resultant output is called pitch-synchronous SFF envelope.

    Fig. 1: SFF time-frequency representation. The SFF spectrogram of an angry utterance (b) is shown in (a)
  8. Finally, the log-spectrum of the pitch-synchronous SFF envelope is computed as follows:


The flow diagram of the proposed pitch-synchronous SFF spectrogram is shown in Fig 2.

Fig. 2: The block diagram of pitch-synchronous SFF method. is the speech signal, which is further pre-emphasized. The pre-emphasized signal is given to both the ZFF and SFF blocks. The output of the ZFF block is the ZFF signal , from which the GCI locations are extracted. The output of the SFF block is , from which the amplitude envelope is derived. The SFF envelope is subsampled by averaging the samples between two consecutive GCI locations. The log of the resultant pitch-synchronous SFF gives the final output .

Ii-B GCI detection using the Zero Frequency Filtering (ZFF) method

The GCI locations are detected by the ZFF method proposed in [26]. Following are the steps to detect the GCIs using the ZFF method.

  1. Speech signal is differentiated to pre-emphasize the high frequency components as:


    The resultant signal is denoted as and known as pre-emphasized signal.

  2. The pre-emphasized signal is passed through a zero frequency resonator. The response of the zero frequency resonator is defined as:


    where is the resonator output, and .

  3. For removing the trend in the resonator output a moving average filter is used whose window length corresponds to the pitch period of that utterance. The resulting output is called the ZFF signal:


    where corresponds to the number of samples used for computing the trend.

  4. The positive zero crossing of the ZFF signal

    corresponds to the epoch location and is denoted by


Fig. 3: GCI detection using ZFF method. (a) is the voiced segment (b) is the corresponding DEGG signal. (c) Zero frequency filtered signal. (d) Epoch location.

Figure 3(a) shows the voiced segment of a sentence . Fig. 3(b) shows the differentiated eloctro-glottograph (DEGG) signal of the voiced segment of 3(a). Figure 3(c) shows the zero frequency filtered (ZFF) signal and Fig. 3(d) shows the positive crossing of the ZFF signal which corresponds to the epoch location.

Ii-C Deep CNN Architecture

Deep learning helps in learning multiple levels of abstraction by composing networks into multiple layers. The network learns features in a hierarchical manner. Low-level features are learned from the raw data and higher level features are learned from the lower level features. Thus, deep learning overcomes the dependency of shallow networks on hand-crafted features by learning the features itself. In general, a convolutional neural network (CNN) is composed of convolution layers and pooling layers. A convolution layer learns the filters for low-level features by itself, whereas traditionally, these filters were designed by human experts. Due to this advantage of automatically learning the features, raw inputs are directly fed to a CNN and it extracts the features from the input. The max-pooling layers extract dominant features that are robust against distortion and noise. Additional convolution and max-pooling layers are capable of deriving more complex features from intermediate features. A CNN is composed of repetitive units called CNN blocks. In this section the architecture of the CNN block used in our work and the learning procedures are discussed.

Ii-C1 CNN block

The CNN block used in our work consists of four components: one convolution layer, one batch normalization (BN) layer


, one rectified exponential linear unit (ReLU) layer

[27], and one max-pooling layer, in that order, as shown in Fig. 4

. The core layers of the CNN block are convolution layer and pooling layer. The BN layer helps in improving the stability and performance of deep neural networks. It normalizes the activations at each batch and maintains zero mean and unit variance


. ReLU layer uses ReLU activation function. It is widely used because it can lead to higher recognition accuracies and faster convergence rates

[27]. Max-pooling layer uses a max-filter that is applied to the sub-regions of the output of the convolution layer [9]. Depending upon the requirements, these CNN blocks can be configured accordingly.

In a convolution layer, each index value of the input , which is a 2D matrix, is convolved with the convolution kernels. The input for the first layer is the raw spectrogram. The input for the convolution layers in the subsequent CNN blocks is the output of the previous CNN block. The convolution operation is performed between the convolution kernel and the input data to produce the feature maps. The convolution kernel of size is initialized randomly. When is convolved with convolution kernel , the resultant output is obtained as:


The resultant output is given as input to the BN layer. The features that are learned by a convolution layer are normalized by the corresponding BN layer. The output of a BN layer is given to the corresponding ReLU layer as follows:


represents the output feature at the layer,
represents the input feature at the layer, and
denotes the convolution kernel between the and feature.

The ReLU activation function (.) of the network is defined as follows:


where denotes the activation potential.

The next layer is max-pooling or downsampling layer. The normalized features are passed to this layer. The resolution of the features is reduced by taking the maximum value of a sub-region in the pooling layer. The output of a max-pooling layer can be defined as:


represents the output features,
represents the input features, and
depicts the pooling patch.

Fig. 4: CNN Block

Ii-C2 Learning Procedures

Learning of the neural network is phrased as an optimization problem to minimize the loss between the targeted and the predicted output. Our work is formulated as a multi-class classification problem. Therefore, the categorical cross-entropy loss[10] function is used. The cross-entropy loss gives a probability value between zero and one to measure the performance of the corresponding classification model. As the predicted probability diverges from the target value, the cross-entropy loss increases. A separate loss for each class label per observation is calculated. All these losses are summed and the combined loss is obtained as follows:


is the number of classes,
is a binary indicator; it is one for correct classification and zero otherwise, and
is the predicted probability of sample for class .

The cross-entropy loss over a mini-batch is computed by taking the average:


where is the number of samples in a batch.

For computing the gradients, the cost function is differentiated with respect to the model parameters and backpropagated to prior layers using the backpropagation algorithm

[21, 31]. After obtaining the gradients, in general, the gradient descent algorithm or one of it’s variants is used for updating the model parameters. In our work, Adam optimizer [19, 30]

has been used for loss optimization. Adam stands for adaptive moment estimation. Using Adam, for each parameter, adaptive learning rates are computed. In addition to computing an exponentially decaying weighted average of past gradients

, similar to momentum, Adam also computes an exponentially decaying weighted average of the squared gradients , where and control the decaying rates of these moving averages as shown below:


where denotes the gradient value at step . The initial value of the moving averages and is close to one. This results in a bias of moment estimates towards zero. To counteract this, a bias correction is applied as follows:


is the first momentum representing the mean of gradients and
is the second momentum representing the uncentered variance of gradients.

For updating the parameters, the final formula is given below:


is the parameter and
is the learning rate.

The classifier used in this architecture is softmax. Softmax is a generalization of logistic regression for multi-class classification. The softmax classifier gives the probabilities for each class label. The softmax function is defined as:


is the input to softmax,

is the weight between the penultimate layer and softmax layer, and

is the activation in the penultimate layer.

The class label of each segmented utterance is predicted as follows:


where is the probability for class .

Iii Experimental Setup

Our proposed model has been evaluated on the IEMOCAP (Interactive emotional dyadic motion capture)[8] dataset. The IEMOCAP dataset is a multi-modal dataset which contains audio, video, text, and gesture information of conversations arranged in dyadic sessions. In this work, we have considered only the audio tracks of the IEMOCAP dataset. The IEMOCAP dataset is a natural-like improvised dataset that reasonably resembles spontaneous real life emotional speeches.

The IEMOCAP dataset comprises of five sessions. In each session, there are two different speakers. There is no overlapping between speakers of different sessions. In each session, there is one male and one female. The conversation of one session is approximately five minutes long. The contents of the dataset are recorded in both scripted and spontaneous scenarios. In our study, we have considered four categorical (class) labelled emotions namely angry, happy, sad and neutral of improvised sessions[8]. We have used leave-one-speaker-out (LOSO) cross validation strategy. In LOSO four sessions are used for training. One speaker form the remaining fifth session is used for validation while the the other speaker is used for testing. This step is repeated for all the five sessions.

In this dataset, all utterances are of different lengths; varying from one to twelve seconds approximately. To deal with the variance in the lengths of the input utterances, the utterances are split into three second segments. Except the last segment, all segmented utterances are

seconds long. Each segment is assigned the label of its parent utterance. These segmented utterances are used for training and validation. During testing, to retrieve the label at the utterance level, the posterior probabilities of all the constituent segments is averaged. The class label with the highest average value is assigned as the corresponding utterance label:


is the predicted label of the utterance,
is the posterior probability of segment ,
is the number of segments, and
is the number of classes, i.e., four.

The pitch-synchronous SFF spectrograms are computed as described in Sec. II-A. There are only two parameters that need to be set: the value of the radius of the filter and the spacing frequencies . The range of selected frequencies is , which covers human speech. The GCI locations are identified as described in Sec. II-B. The maximum number of GCI locations found in our experiments in a second segment is . Therefore, the size of feature matrix for a segment is , where is the number of frequency bins. The feature matrix of segments having less than

GCI locations is padded with zeros to reach the fixed size of

. We compared the pitch-synchronous SFF spectrogram with the STFT spectrogram and the SFF- ms spectrogram. In the STFT spectrogram, the speech signal is segmented into frames of ms with an overlapping of ms. Hamming window is applied on each frame to compensate the end effect of the frame. The discrete Fourier transform of each frame is calculated with the window length of 800 (for Hz). In SFF- ms, the averaging of the amplitude of all the samples is performed with % overlapping.

We used a convolutional neural network (CNN) for evaluating our proposed pitch-synchronous SFF. To ensure exactly the same experimental conditions, we have implemented the same CNN architecture as proposed in state-of-the-art method [32], except that we have used batch normalization for each convolution layer. The detailed parameters of each CNN block is given in Table I. The used CNN network has the following structure. It has three CNN blocks (CNN Block1, CNN Block2, and CNN Block3), one fully connected layer and the output layer with the softmax classifier. The convolution and pooling kernels in each CNN block are . In CNN Block1, there are convolution kernels of size . CNN Block2 consists of kernels of size and CNN Block3 has kernels of size . The sizes of the max-pooling layer of the first, second and third CNN blocks is , and respectively.

The output obtained from the convolution layers represents high-level features. The output of the last CNN block is passed through the fully connected layer to learn non-linear combinations of these features for SER. The pitch-synchronous SFF spectrogram and the STFT spectrogram were implemented in MATLAB. The model is implemented using Keras, which is a library widely used for deep learning.

Name Layers Kernel Size Output Size
CNN Block1 Convolution2D 12 16 189 284 16
Max Pooling2D 100 150 90 135 16
CNN Block2 Convolution2D 8 12 83 124 24
Max Pooling2D 50 75 34 50 24
CNN Block3 Convolution2D 5 7 30 44 32
Max Pooling2D 25 37 6 8 32
Fully Connected Layer 64 64
Output Layer 4 4
TABLE I: Parameters of each CNN-Block

Iv Result Discussion

The IEMOCAP dataset used in this work is significantly imbalanced. To cope with the imbalance issue, weights are assigned to each class during training such that the weight of a class is inversely proportional to the total number of samples in that particular class. The testing data is also imbalanced. Hence, we calculate unweighted as well as weighted accuracy. Unweighted accuracy is important for imbalanced data because it gives equal weightage to each class. Specifically, it is not biased against minority classes. Formally:

  • Overall accuracy or weighted accuracy (WA) is defined as the number of samples predicted correctly out of the total number of samples.

  • Average class accuracy or unweighted accuracy (UWA) is defined as the average accuracy of individual classes. The individual class accuracy is defined as the number of samples predicted correctly out of the total number of samples in that class.

The comparison of the pitch-synchronous SFF spectrogram with the STFT spectrogram for the four considered emotions is shown in Fig. 5. All the spectrograms are drawn from the same sentence of the same speaker. Pitch-synchronous SFF is an adaptive framing approach in which the frame-size is derived from the instantaneous pitch-period in the corresponding utterance. It can be seen that the pitch-synchronous SFF spectrogram captures clearer harmonics than the STFT spectrogram. The pitch-synchronous SFF spectrogram also contains more fine grained temporal information than the STFT spectrogram. These properties of the pitch-synchronous SFF spectrogram enable better discrimination between anger and happy emotions, which are known to produce very similar characteristics with respect to the existing features. The pitch-synchronous SFF spectrograms of happy and anger emotions exhibit clear energy and formant structures than the neutral emotion. Figure 5 (c) & (d) show the STFT spectrogram and the pitch-synchronous SFF spectrogram of neutral and sad emotions. The STFT spectrogram produces similar pattern for neutral and happy emotions, i.e., the energy concentrations are same across the entire frequency region. This is not good for discriminating among these two emotions. However, the pitch patterns of happy emotion are prominently different than the neutral emotion in the case of the pitch-synchronous SFF spectrograms. This results in a superior discrimination of happy and neutral emotions.

In the last layers of CNN architecture, the relationship between the feature maps are abstracted and the meaningful timbre spectrum is obtained. The GCI locations are present in the voiced regions. A frame is constructed based on the GCI locations, which automatically reduces the effect of silence region. We can clearly see in all the pitch-synchronous SFF spectrograms that the effect of silence between words is highly reduced.

(a) Pitch-synchronous SFF
(b) Pitch-synchronous SFF
(c) STFT Spectrogram

(a) Anger
(d) STFT Spectrogram

(b) Happy
(e) Pitch-synchronous SFF
(f) Pitch-synchronous SFF
(g) STFT Spectrogram

(c) Neutral
(h) STFT Spectrogram

(d) Sad
Fig. 5: The pitch-synchronous SFF and STFT spectrograms of the (a) anger, (b) happy, (c) neutral, and (d) sad emotions. In each of the sub figures, the top panel shows the corresponding pitch-synchronous SFF spectrogram while the bottom pannel shows the corresponding STFT spectrogram.
Fig. 6: Emotion classification performance(%) using STFT spectrogram, SFF-20 ms spectrogram and pitch-synchronous SFF spectrogram
Representation UWA WA
Proposed Pitch-Synchronous SFF Spectrogram 63.95 70.4
SFF-20 ms 59.09 67.84
Baseline STFT-Spectrogram 58.55 67.04
Satt et al.[32] STFT Spectrogram 56.6 66.1
TABLE II: Comparison of Weighted accuracy and Unweighted accuracy of the proposed method with the baseline method

Regularization techniques have been used to prevent overfitting. Regularization makes slight changes in the learning algorithms such that the model can generalize better, which improves the performance of the model on unseen data. This also helps in faster convergence. We have used multiple regularization techniques, namely batch normalization, dropout, LOSO cross validation, model selection, and early stopping. Batch normalization is used to normalize the convolution layer outputs to reduce the vanishing gradient problem in the activation layer. Batch normalization reduces the dependency of the designed network on weight initialization. It improves the gradient flow through the network and also adds some amount of regularization into the network since the empirical means and variances are calculated using samples from mini-batches. Therefore, batch normalization helps to increase the accuracy during testing.

Dropout has also been used as a regularization technique in this work. At each iteration, a random number of nodes are selected and removed. Thus, different number of nodes at every iteration result in different outputs. It is similar to an ensemble technique used in machine learning that captures more randomness. LOSO cross-validation is also used to avoid overfitting of the model. Experimental results show that the learned network is better generalized after cross-validation. The model was saved on the basis of maximum validation accuracy. The early stopping criteria is used to stop the model based on the value of the patience parameter. Patience is the number of epochs for which a model should continue training in spite of no improvement in the validation accuracy. The value of patience is chosen as five. In our experiments, several methods for reducing overfitting were used but overfitting was not completely avoided.

We have compared the proposed pitch-synchronous SFF spectrogram model with the following:

  • State-of-the-art result on STFT spectrogram using CNN as proposed by Satt, Rozenberg and Hooory [32].

  • The batch normalized model of state-of-the-art result [32] on the STFT spectrogram. We call this baseline model because any improvement of the proposed model over baseline is purely due to the pitch-synchronous SFF technique.

  • SFF-20 ms spectrogram. This represents the use of the SFF spectrogram with the traditional block processing technique.

Table II shows the corresponding comparison results. The used regularization techniques improved WA and UWA by % and % respectively over state-of-the-art [32]. There is further, albeit small, improvement when using the SFF-20ms spectrogram. The best results are obtained for the proposed pitch-synchronous SFF spectrogram with accuracy values of 63.95% (UWA) and 70.4% (WA). These correspond to improvements of 7.35% (UWA) and 4.3% (WA) over the state-of-the-art STFT spectrogram [32]. Looking at the results for the proposed pitch-synchronous SFF spectrogram and the SFF-20ms spectrogram, it can be said that a significant portion of improvement over state-of-the-art STFT spectrogram is due to logical windowing between two successive GCI locations, however, it must be noted that such logical windowing is not possible with the STFT spectrogram. This is because the STFT spectrogram does not obtain information at each instant of time.

The confusion matrices of the baseline STFT spectrogram and the pitch-synchronous SFF spectrogram are shown in Tables III and IV respectively. The accuracy of anger and sad emotions are satisfactory using the baseline STFT spectrogram. But, it fails to categorize happy emotion. All the samples of happy emotion are predicted as neutral using the STFT spectrogram. The same is also true for state-of-the-art results in [32]

(the corresponding confusion matrix has not been shown here). However, the pitch-synchronous SFF spectrogram categorizes

% of happy emotions correctly. The accuracy of happy and neutral emotions are improved by % and % respectively in the case of the pitch-synchronous SFF spectrogram. This can be considered as a big plus for the the pitch-synchronous SFF spectrogram when compared to the STFT spectrogram. The accuracies of sad and anger emotions are approximately same for the STFT and pitch-synchronous SFF spectrograms.

Emotion Anger Happy Neutral Sad
Anger 91.67 0 0 8.33
Happy 40.91 0 50 9.1
Neutral 17.86 3.57 52.67 25.89
Sad 0 0 10.12 89.87
TABLE III: Confusion matrix of STFT spectrogram in percentage
Emotion Anger Happy Neutral Sad
Anger 81.25 4.16 14.58 0
Happy 27.27 22.7 45.45 4.55
Neutral 6.25 9.82 64.28 19.64
Sad 0 0 13.92 86.07
TABLE IV: Confusion matrix for proposed pitch-synchronous SFF spectrogram in percentage

The class accuracies of each emotion for the STFT, SFF-20 ms and pitch-synchronous SFF spectrograms are shown in Fig. 6. The SFF-20 ms spectrogram is computed from the SFF envelope. The framing procedure is same as the traditional STFT spectrogram where 20 ms frame size with 50% overlapping of previous frame is taken. The average of all the samples in a frame of an SFF envelope is computed. Using the SFF-20 ms spectrogram, an improvement of 2.49% (UWA) and 1.74% (WA) was observed as compared to the state-of-the-art STFT spectrogram using CNN [32]. The corresponding improvement for pitch-synchronous SFF spectrogram is 7.35% (UWA) and 4.3% (WA).

V Conclusion and Future Work

This paper highlights the drawbacks of feature representation of the existing STFT spectrogram for SER. We have proposed a novel pitch-synchronous SFF spectrogram for SER that overcomes these drawbacks. We have attempted to solve the problem of SER using deep convolutional neural networks. Our proposed architecture consists of three CNN blocks followed by a fully connected layer and an output layer for detecting four emotions (i.e. Anger, Happy, Neutral, Sad). On IEMOCAP dataset, the proposed pitch-synchronous SFF spectrogram achieved an improvement of % and % for weighted and unweighted accuracy values respectively over state-of-the-art STFT spectrogram representation using CNN. Specially, the proposed pitch-synchronous SFF spectrogram recognized 22.7% of the happy emotion samples correctly, whereas this number was 0% for state-of-the-art STFT spectrogram representation using CNN.

Pitch-synchronous SFF spectrogram can be used in various other applications such as speaker identification [18, 36], speaker verification [39], audio classification [41] etc. The SFF output has high SNR values of speech in the time-frequency domain. Our future work is to explore these properties of SFF to develop a robust speech emotion recognition against degradations due to noise.


Akshay Deepak has been awarded Young Faculty Research Fellowship (YFRF) of Visvesvaraya PhD Programme of Ministry of Electronics & Information Technology, MeitY, Government of India. In this regard, he would like to acknowledge that this publication is an outcome of the R&D work undertaken in the project under the Visvesvaraya PhD Scheme of Ministry of Electronics & Information Technol- ogy, Government of India, being implemented by Digital India Corporation (formerly Media Lab Asia).


  • [1] Jamil Ahmad, Mustansar Fiaz, Soon-il Kwon, Maleerat Sodanil, Bay Vo, and Sung Wook Baik. Gender identification using mfcc for telephone applications-a comparative study. arXiv preprint arXiv:1601.01577, 2016.
  • [2] Masato Akagi, Xiao Han, Reda Elbarougy, Yasuhiro Hamada, and Junfeng Li. Toward affective speech-to-speech translation: Strategy for emotional speech recognition and synthesis in multiple languages. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific, pages 1–10. IEEE, 2014.
  • [3] G Aneeja and B Yegnanarayana. Single frequency filtering approach for discriminating speech and nonspeech. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 23(4):705–717, 2015.
  • [4] G Aneeja and B Yegnanarayana. Extraction of fundamental frequency from degraded speech using temporal envelopes at high snr frequencies. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(4):829–838, 2017.
  • [5] Abdul Malik Badshah, Jamil Ahmad, Nasir Rahim, and Sung Wook Baik. Speech emotion recognition from spectrograms with deep convolutional neural network. In Platform Technology and Service (PlatCon), 2017 International Conference on, pages 1–5. IEEE, 2017.
  • [6] Abdul Malik Badshah, Nasir Rahim, Noor Ullah, Jamil Ahmad, Khan Muhammad, Mi Young Lee, Soonil Kwon, and Sung Wook Baik. Deep features-based speech emotion recognition for smart affective services. Multimedia Tools and Applications, 78(5):5571–5589, 2019.
  • [7] Yegnanarayana Bayya and Dhananjaya N Gowda. Spectro-temporal analysis of speech signals using zero-time windowing and group delay function. Speech Communication, 55(6):782–795, 2013.
  • [8] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335, 2008.
  • [9] Dan Cireşan, Ueli Meier, and Jürgen Schmidhuber. Multi-column deep neural networks for image classification. arXiv preprint arXiv:1202.2745, 2012.
  • [10] Pieter-Tjerk De Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein. A tutorial on the cross-entropy method. Annals of operations research, 134(1):19–67, 2005.
  • [11] Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3):572–587, 2011.
  • [12] Haytham M Fayek, Margaret Lech, and Lawrence Cavedon. Evaluating deep learning architectures for speech emotion recognition. Neural Networks, 92:60–68, 2017.
  • [13] Daniel Joseph France, Richard G Shiavi, Stephen Silverman, Marilyn Silverman, and M Wilkes. Acoustical properties of speech as indicators of depression and suicidal risk. IEEE transactions on Biomedical Engineering, 47(7):829–837, 2000.
  • [14] Ling He, Margaret Lech, and Nicholas Allen. On the importance of glottal flow spectral energy for the recognition of emotions in speech. In Eleventh Annual Conference of the International Speech Communication Association, 2010.
  • [15] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
  • [16] Shubha Kadambe and Gloria Faye Boudreaux-Bartels. Application of the wavelet transform for pitch detection of speech signals. IEEE transactions on Information Theory, 38(2):917–924, 1992.
  • [17] Sudarsana Reddy Kadiri and B Yegnanarayana. Epoch extraction from emotional speech using single frequency filtering approach. Speech Communication, 86:52–63, 2017.
  • [18] HB Kekre, Vaishali Kulkarni, Prashant Gaikar, and Nishant Gupta. Speaker identification using spectrograms of varying frame sizes. International Journal of Computer Applications, 50(20), 2012.
  • [19] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [20] Gudrun Klasmeyer. The perceptual importance of selected voice quality parameters. In 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 3, pages 1615–1618. IEEE, 1997.
  • [21] James Leonard and MA Kramer. Improvement of the backpropagation algorithm for training neural networks. Computers & Chemical Engineering, 14(3):337–341, 1990.
  • [22] Ming Li, Kyu J Han, and Shrikanth Narayanan. Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Computer Speech & Language, 27(1):151–167, 2013.
  • [23] Qirong Mao, Ming Dong, Zhengwei Huang, and Yongzhao Zhan. Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia, 16(8):2203–2213, 2014.
  • [24] Hugo Meinedo and Isabel Trancoso. Age and gender classification using fusion of acoustic and prosodic features. In Eleventh Annual Conference of the International Speech Communication Association, 2010.
  • [25] Seyedmahdad Mirsamadi, Emad Barsoum, and Cha Zhang.

    Automatic speech emotion recognition using recurrent neural networks with local attention.

    In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 2227–2231. IEEE, 2017.
  • [26] K Sri Rama Murty and B Yegnanarayana. Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16(8):1602–1613, 2008.
  • [27] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
  • [28] Michael Neumann and Ngoc Thang Vu. Attentive convolutional neural network based speech emotion recognition: A study on the impact of input features, signal length, and acted speech. arXiv preprint arXiv:1706.00612, 2017.
  • [29] Vishala Pannala, G Aneeja, Sudarsana Reddy Kadiri, and B Yegnanarayana. Robust estimation of fundamental frequency using single frequency filtering approach. In INTERSPEECH, pages 2155–2159, 2016.
  • [30] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. 2018.
  • [31] Martin Riedmiller and Heinrich Braun. A direct adaptive method for faster backpropagation learning: The rprop algorithm. In Proceedings of the IEEE international conference on neural networks, volume 1993, pages 586–591. San Francisco, 1993.
  • [32] Aharon Satt, Shai Rozenberg, and Ron Hoory. Efficient emotion recognition from speech using deep learning on spectrograms. Proc. Interspeech 2017, pages 1089–1093, 2017.
  • [33] Björn Schuller, Gerhard Rigoll, and Manfred Lang.

    Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture.

    In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages I–577. IEEE, 2004.
  • [34] Elizabeth Shriberg, Luciana Ferrer, Sachin Kajarekar, Anand Venkataraman, and Andreas Stolcke. Modeling prosodic feature sequences for speaker recognition. Speech Communication, 46(3-4):455–472, 2005.
  • [35] George Trigeorgis, Fabien Ringeval, Raymond Brueckner, Erik Marchi, Mihalis A Nicolaou, Björn Schuller, and Stefanos Zafeiriou. Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5200–5204. IEEE, 2016.
  • [36] Jia-Ching Wang, Chien-Yao Wang, Yu-Hao Chin, Yu-Ting Liu, En-Ting Chen, and Pao-Chi Chang.

    Spectral-temporal receptive fields and mfcc balanced feature extraction for robust speaker recognition.

    Multimedia Tools and Applications, 76(3):4055–4068, 2017.
  • [37] Chenjian Wu, Chengwei Huang, and Hong Chen. Text-independent speech emotion recognition using frequency adaptive features. Multimedia Tools and Applications, 77(18):24353–24363, 2018.
  • [38] Jainath Yadav, Md Shah Fahad, and K Sreenivasa Rao. Epoch detection from emotional speech signal using zero time windowing. Speech Communication, 96:142–149, 2018.
  • [39] Tsuei-Chi Yeh and Wen-Yuan Chen. Method for identifying authorized users using a spectrogram and apparatus of the same, August 22 2002. US Patent App. 09/884,287.
  • [40] Promod Yenigalla, Abhay Kumar, Suraj Tripathi, Chirag Singh, Sibsambhu Kar, and Jithendra Vepa. Speech emotion recognition using spectrogram & phoneme embedding. In Interspeech, 2018.
  • [41] Yuni Zeng, Hua Mao, Dezhong Peng, and Zhang Yi. Spectrogram based multi-task audio classification. Multimedia Tools and Applications, 78(3):3705–3722, 2019.