DeepAI
Log In Sign Up

End-to-End Speaker Height and age estimation using Attention Mechanism with LSTM-RNN

01/13/2021
by   Manav Kaushik, et al.
0

Automatic height and age estimation of speakers using acoustic features is widely used for the purpose of human-computer interaction, forensics, etc. In this work, we propose a novel approach of using attention mechanism to build an end-to-end architecture for height and age estimation. The attention mechanism is combined with Long Short-Term Memory(LSTM) encoder which is able to capture long-term dependencies in the input acoustic features. We modify the conventionally used Attention – which calculates context vectors the sum of attention only across timeframes – by introducing a modified context vector which takes into account total attention across encoder units as well, giving us a new cross-attention mechanism. Apart from this, we also investigate a multi-task learning approach for jointly estimating speaker height and age. We train and test our model on the TIMIT corpus. Our model outperforms several approaches in the literature. We achieve a root mean square error (RMSE) of 6.92cm and6.34cm for male and female heights respectively and RMSE of 7.85years and 8.75years for male and females ages respectively. By tracking the attention weights allocated to different phones, we find that Vowel phones are most important whistlestop phones are least important for the estimation task.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

10/24/2021

Learning Speaker Representation with Semi-supervised Learning approach for Speaker Profiling

Speaker profiling, which aims to estimate speaker characteristics such a...
03/22/2017

Hierarchical RNN with Static Sentence-Level Attention for Text-Based Speaker Change Detection

Traditional speaker change detection in dialogues is typically based on ...
07/14/2021

Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection

Active speaker detection (ASD) seeks to detect who is speaking in a visu...
03/31/2019

Learning Shared Encoding Representation for End-to-End Speech Recognition Models

In this work, we learn a shared encoding representation for a multi-task...
05/31/2021

Multi-Scale Attention Neural Network for Acoustic Echo Cancellation

Acoustic Echo Cancellation (AEC) plays a key role in speech interaction ...

1 Introduction

Speech is a unique physiological signal which not only contains information about the linguistic content (such as words, accent, language, etc.) but also conveys the para-linguistic content (such as height, age, gender, emotions, etc.). This helps us in estimating the physical parameters like height and age of a speaker, which holds a wide variety of applications in the real-world such as natural human-machine interaction, speaker profiling, and forensics [1, 2].

A typical approach for speaker characteristic estimation is to apply shallow learning techniques, such as linear regression

[3]

or support vector machine

[4, 5, 6], on top of utterance-level representation such as i-vector [4, 6] or x-vector [7]. Such approaches are not end-to-end since the utterance-level representation extractors are trained separately for speaker recognition tasks which are not optimized for height and age estimation.

In this work, we propose an end-to-end approach for speaker height and age estimation. The acoustic features are encoded using an LSTM network which allows capturing long-term dependencies. The uniqueness and novelty of our technique come from two aspects. Firstly, we propose to use attention mechanism for speaker characteristic estimation task. As from our best knowledge, there is not any work in the literature which studies the use of attention for speaker height and age estimation. Secondly, instead of performing attention across speech frames, as done conventionally [8, 9], we perform attention across both speech frames and encoder units to obtain two context vectors then combine them to generate a final context vector. We believe that the proposed attention, denoted as cross attention, captures much more information than the conventional counterpart and hence could produce better performance.

By analyzing attention weights across speech frames, we find that that highest weights have been assigned to Vowel phones while the lowest weights have been assigned to Stop phones. Since speaker characteristics such as height and age are correlated with the length of speaker vocal tract [10, 11], this higher attention to vowel phones maybe attributed to the fact that these phones require the use of the glottal vocal tract while utterance of stop phones does not involve its use.

Apart from attention, we study how passing in gender information as a feature helps the model to better estimate the height and age of a speaker.

2 Related Works

Most of previous studies on height and age estimation tend to use conventional approaches of applying shallow learning techniques. For instance, Williams et al. [3]

combine Gaussian Mixture Models (GMM) and linear regression subsystems to estimate the speaker height. Poorjam et al.

[4] and Bahari et al. [12] predict speaker height and age by applying least-squares Support Vector Regression (SVR) on top of i-vector. Mahmoodi et al. [5] use Support Vector Machines (SVMs) while Bocklet et al. [6] use GMM supervectors with SVM for age estimation task. Singh et al. [10]

use a bag of words representation generated from short-term cepstral features and train a Random Forest regressor for age and height estimation. The issue with the above mentioned approaches is that none of them are end-to-end modeling techniques and thus, are not specifically optimized to speaker physical parameter estimation such has height and age.

More recently, Ghahremani et al. [7]

propose an end-to-end deep neural network (DNN) for age prediction while Kalluri et al.

[13] also attempt to jointly predict both height and age of speaker using a unified end-to-end DNN model which is initialized using a conventional system based on SVR trained with Gaussian Mixture Model-Universal Background Model (GMM-UBM) supervector features. Although, both of these works employ an end-to-end architecture of their estimation tasks, they rely more on conventional approaches.

Attention mechanism[14]

has been successfully applied for various research topics such as neural machine translation, keyword spotting

[9]

, and computer vision

[15, 16, 8]. To our best knowledge, none of the past works in the literature have used attention for speaker physical parameter estimation. Our work is the first in the literature to demonstrate the potential of attention mechanism in tracking the relational dependency and importance of different phones in estimating speaker height and age in an utterance.

3 Dataset Used

We use the TIMIT dataset [17] for all our experiments done in this study. TIMIT has a total of 6300 unique utterances. There are 630 speakers of these utterances who are distributed across 8 different dialect regions with each speaker speaking ten different utterances. The gender distribution of the speakers in male to female is 2:1. Moreover, the dataset also includes time-aligned orthographic, phonetic and word transcriptions which help us track phonetic attention.

The train-test split is given in the dataset i.e. 461 speakers (326 male and 135 female) for training and validation, and 162 speakers (112 male and 56 female) for testing. The height of speakers in the training data ranges from 145cm to 199cm and in testing data, they range from 153cm to 204cm. Similarly, the age of speakers ranges from 21 years to 76 years in training data and 22 years to 68 years in test data. There is no overlapping of speakers between test and training datasets. Moreover, the duration of the utterances ranges from 1- 6s with an average of about 2.5s.

4 Methodologies

Our proposed framework for end-to-end speaker height and age estimation is shown in Fig. 1. First, data preprocessing is applied on the input audio to generate acoustic features such as Filter bank energies and pitch. Then these features are encoded by an LSTM network before feeding them to an attention layer. Subsequently, the output of attention layer, which is a vector, is transformed by a dense layer to make height and age prediction. In following subsections, we describe these processes in detail.

4.1 Data Preprocessing and Augmentation

Our input for each utterance is a two-dimensional matrix consisting of

timeframes with each timeframe consisting of 83 acoustic features (80 filter bank and 3 pitch features) extracted from windows of 25ms with 10ms stride. We apply Cepstral Mean and Variance Normalization (CMVN) to these acoustic features. The resulting features are

where each frame, , contains 83 features.

Since neural network models require a huge amount of data to train properly, we use speed perturbation as an augmentation step to obtain the audio signals at 1.1x and 0.9x speeds as well. Apart from this, we use spectral augmentation (SpecAugment) [18] to enhance the robustness of our model by randomly masking strands from feature and time axes covering approximately 10-12% of the training data.

Figure 1: The proposed framework for speaker height and age estimation

4.2 LSTM RNN Encoding

Since LSTM has shown to be efficient to capture long temporal dependencies, we chose this architecture to encode acoustic features for our study. Given a sequence of input features , the LSTM network processes it frame-by-frame to generate the sequence of hidden states where each state has dimension of i.e. number of LSTM units.

Once the input features are encoded, the straightforward approach is to take the final hidden state i.e. as the utterance-level representation for height and age estimation. We denote this setting as ”last hidden state” for the subsequent experiment in Table 1 in Section 4.3. However, in practice, LSTM tends to forget information when operated on longer sequences. Therefore, we propose to use attention mechanism to solve this problem as presented in Section 4.3.

4.3 Attention Mechanism

The attention mechanism was primarily introduced to help memorizing long sentences in neural machine translation [14]. It generates a context vector as weighted sum of hidden states of the LSTM encoder over all timeframes. Since the attention mechanism has access to the entire input sequence, the problem of forgetting initial parts of the sequence is solved. We use soft attention which is typically used in previous works [8, 9].

First, a scalar score is estimated for each LSTM hidden state as:

(1)

where , and are learnable parameters. Then, the attention weights are obtained by applying a softmax function on , i.e.

(2)

Since we use a softmax function, and . After this, we obtain a context vector, , as the weighted average across all timeframes of the LSTM outputs :

(3)

As a result, has the same dimension as hidden states i.e.

Instead of considering as the final context vector (which is the conventional approach [8, 9]), we propose a cross-attention approach in which we further perform attention across all the LSTM units to generate another context vector, denoted as , and concatenate them to obtain the final context vector .

(4)

Note that has dimension of , hence has dimension of + . is finally passed into a dense layer which makes the final prediction.

4.4 Dense Layers

For each of height or age, the estimation made as:

(5)

where is a learnable vector of length same as

. We use Mean Squared Error (MSE) as our loss function:

(6)

where and are the actual and predicted values respectively for utterance and is the total number of utterances.

We also study multi-task learning which aims to estimate height and age at the same time. The training loss for our approach is calculated as:

(7)

where is a hyper-parameter optimized on the validation set.

5 Experimental Results

5.1 Experimental Setup

The LSTM network consists of 1 hidden layer with 128 units. Moreover, we use a dropout regularization of 20% and a recurrent dropout of 20% to avoid overfitting. We also apply dropout regularization of 20% on the dense layer.

For the performance analysis of models, we use standard metrics of Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE), which are defined as:

(8)

where and are the actual and predicted values respectively of -th utterance and is the total number of utterances.

5.2 Quantitative Results

In the subsequent experiments, we analyze the performance of different models from the literature and different variation of attention mechanism that we propose (at an analysis window of 25ms).

Techniques Multi-task? Gender Consideration Height (cm) Age (years)
RMSE MAE RMSE MAE
Last hidden state Yes Not considered Male 8.23 6.86 9.24 7.03
Female 7.94 6.19 10.20 7.71
Conventional-Attention Yes Not considered Male 6.99 5.42 8.08 5.84
Female 6.62 5.36 9.08 6.24
Cross-Attention No Not considered Male 6.98 5.38 8.16 5.97
Female 6.56 5.30 9.12 6.27
Cross-Attention Yes Not considered Male 6.94 5.29 7.90 5.62
Female 6.40 5.15 8.87 6.16
Cross-Attention Yes As a binary feature Male 6.92 5.24 7.85 5.62
Female 6.34 5.09 8.75 6.08
Singh et al. [10] No Separate Models Male 6.9 5.2 8.0 5.7
Female 6.3 5.1 8.8 6.1
Kalluri et al. [13] Yes Separate Models Male 6.85 - 7.60 -
Female 6.29 - 8.63 -
Mporas et al. [19] No Separate Models Male 6.8 5.3 - -
Female 6.3 5.1 - -
Williams et al. [3] No Separate Models Male - 5.37 - -
Female - 5.49 - -
Table 1: Comparison of the end-to-end framework with different settings and literature

We first compare results of different attention techniques: last hidden state (mentioned in Section 4.2); Conventional attention [8, 9]; proposed cross attention with single-task and multi-task learning. Then we show the effect of combining gender information (binary 0/1 value) with acoustic features. Results are shown in Table 1.

We observe that the proposed cross attention outperforms other techniques in all conditions. We also see multi-task learning tends to enhance the generalization ability of the model and thus, gives better results compared to single-task. Finally, adding gender information consistently improves performance of our model. Comparing with other works from the literature, it may be seen that our proposed model outperforms several approaches in literature and is competitive with the state-of-the-art. The advantage of our approach is to use a single end-to-end model for both height and age estimation.

5.3 Analysis

We study which phones are important for age and height estimation by tracking the attention weights for each phone in the TIMIT data. We note that TIMIT contains manual phone boundaries, therefore, we can infer phone labels for each timeframe of an utterance. We accumulate the weight across all utterances to obtain an average weight for each phone.

There are a total of 60 different phones which have been used in the TIMIT dataset, and a broader analysis of attention weight distribution shows us that highest attention weights have been assigned to Vowel phones (such as ay, aw, aa, ae, ao, eh, ey, etc. as shown in Fig. 2) which actually hold the highest amount of linguistic and para-linguistic information as they involve the use of glottal vocal tract [10] while lowest weights have been assigned to Stop phones (such as d, b, p, k ,etc.) and some of the Fricatives (such as s, sh, f, etc.) which do not involve use of glottal vocal tract.

Figure 2: 10 Highest & 10 Least Attended Phones respectively

6 Conclusions

We have proposed a cross-attention approach for the task of joint speaker height and age estimation. The proposed approach not only performed attention across timeframes but also performed attention across hidden units which produces more informative context vector. Experimental results on TIMIT data shows that our proposed approach combined with multi-task learning outperforms many models in the literature. Furthermore, by tracking attention weights across timeframes, we found that Vowel phones are most important while Stop phones have been least attended by the soft attention model for speaker physical characteristics estimation.

References

  • [1] D.C. Tanner and M.E. Tanner, Forensic Aspects of Speech Patterns: Voice Prints, Speaker Profiling, Lie and Intoxication Detection, Lawyers & Judges Publishing Company, 2004.
  • [2] Björn Schuller, Stefan Steidl, Anton Batliner, Felix Burkhardt, Laurence Devillers, Christian Müller, and Shrikanth Narayanan, “Paralinguistics in speech and language - state-of-the-art and the challenge,” Computer Speech and Language, Special Issue on Paralinguistics in Naturalistic Speech and Language, 01 2013.
  • [3] Keri A. Williams and J. Hansen, “Speaker height estimation combining gmm and linear regression subsystems,” in Proc. of ICASSP, 2013, pp. 7552–7556.
  • [4] Amir Hossein Poorjam, Mohamad Hasan Bahari, Vasileios Vasilakakis, et al., “Height estimation from speech signals using i-vectors and least-squares support vector regression,” in Proc. of ICTSP. IEEE, 2015, pp. 1–5.
  • [5] Davood Mahmoodi, Hossein Marvi, Mehdi Taghizadeh, Ali Soleimani, Farbod Razzazi, and Marzieh Mahmoodi, “Age estimation based on speech features and support vector machine,” in Proc. of CEEC. IEEE, 2011, pp. 60–64.
  • [6] Tobias Bocklet, Andreas Maier, and Elmar Nöth, “Age determination of children in preschool and primary school age with gmm-based supervectors and support vector machines/regression,” in Proc. of International Conference on Text, Speech and Dialogue. Springer, 2008, pp. 253–260.
  • [7] Pegah Ghahremani, Phani Sankar Nidadavolu, Nanxin Chen, Jesús Villalba, Daniel Povey, Sanjeev Khudanpur, and Najim Dehak, “End-to-end deep neural network age estimation.,” in Proc. of INTERSPEECH, 2018, pp. 277–281.
  • [8] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. of ICML, 2015, pp. 2048–2057.
  • [9] Changhao Shan, Junbo Zhang, Yujun Wang, and Lei Xie, “Attention-based end-to-end models for small-footprint keyword spotting,” arXiv preprint arXiv:1803.10916, 2018.
  • [10] Rita Singh, Bhiksha Raj, and James Baker, “Short-term analysis for estimating physical parameters of speakers,” in Proc. of IWBF. IEEE, 2016, pp. 1–6.
  • [11] David RR Smith and Roy D Patterson, “The interaction of glottal-pulse rate and vocal-tract length in judgements of speaker size, sex, and age,” The Journal of the Acoustical Society of America, vol. 118, no. 5, pp. 3177–3186, 2005.
  • [12] Mohamad Hasan Bahari, ML McLaren, DA Van Leeuwen, et al., “Age estimation from telephone speech using i-vectors,” in Proc. of INTERSPEECH. Portland, USA, 2012.
  • [13] Shareef Babu Kalluri, Deepu Vijayasenan, and Sriram Ganapathy, “A deep neural network based end to end model for joint height and age estimation from short duration speech,” in Proc. of ICASSP. IEEE, 2019, pp. 6580–6584.
  • [14] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
  • [15] Volodymyr Mnih, Nicolas Heess, Alex Graves, et al., “Recurrent models of visual attention,” in Advances in neural information processing systems, 2014, pp. 2204–2212.
  • [16] Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu, “Multiple object recognition with visual attention,” arXiv preprint arXiv:1412.7755, 2014.
  • [17] John S Garofolo, “Timit acoustic phonetic continuous speech corpus,” Linguistic Data Consortium, 1993, 1993.
  • [18] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” arXiv preprint arXiv:1904.08779, 2019.
  • [19] Iosif Mporas and Todor Ganchev, “Estimation of unknown speaker’s height from speech,” International Journal of Speech Technology, vol. 12, no. 4, pp. 149–160, 2009.