Since its introduction in 2017, the VoxCeleb corpora [16, 5] have become widely used in speech research. The audio data, harvested from YouTube recordings of celebrities, represents vast variation in speakers, speaking styles, and environments. This has facilitated development of new methods to handle such a challenging domain. VoxCeleb has most heavily been used in speaker recognition studies [14, 8] but other examples include speaker diarization , universal speech representation , and visual avatar synthesis . The availability of both video and audio data facilitates multimodal person authentication experiments, where speech and face are used jointly .
Nonetheless, speaker identity is only one of many potential paralinguistic attributes. Two other notable ones include gender and age. One interesting property of VoxCeleb is that it contains many within-speaker variations — including recordings of the same person in different ages. Speaker recognition systems can be improved by the use of additional metadata such as gender and age [13, 21]. Further, studies in [10, 11] indicate negative effect of certain age and gender groups on speaker recognition performance. Therefore, we expect age- and gender-aware speaker recognition systems to be robust in multiple scenarios. Besides the purposes of analyzing or enhancing speaker recognition, age and gender metadata has other use cases — such as constructing age and gender classifiers that is integrated for downstream application. Unfortunately, while gender metadata is included in VoxCeleb collections, age information is unavailable. This makes it challenging to address the above tasks on the otherwise versatile VoxCeleb data.
Whether from speech only or from the face image, perceptual experiments have indicated that humans have some capability of estimating the ages of persons unknown to them 
. How accurate automatic systems are in this task? How robust automatic age estimation is to different nuisance factors? Age estimation performance is typically evaluated as a MAE between estimated and actual ages. Estimating age from images with recent deep learning methods reaches MAEs in the range 2.17 to 3.34[3, 17, 18, 22]. Frontal face and non-occluded face is obviously an easiest to estimate the age from, and the performance can degrade when parts of the face are occluded. With speech some of the challenges include variable length utterances along various technical nuisance factors (e.g. noise and channel) that can mask age-related cues.
Early efforts in estimating age from the speech were based on NIST SRE 2008 and 2010 corpora as the age was included in the metadata. However, these corpora can hardly be considered ‘in the wild’ variety as the participants knew that they are participating in the data collection effort and the topic is controlled by the organizers. It is also noteworthy that subjects in the NIST SRE 2008 and 2010 corpora were mostly university students and, therefore, represent limited age range variations. Estimating age from i-vectors yielded MAE of 6.9 
and neural network back-ends yielded 5.49 MAE for male speakers and 6.35 MAE for females. Estimating age using senone posterior based i-vectors then yielded 4.7 MAE .
Previous attempts at speaker age estimation use fairly controlled telephony data as the source material. We are interested to find out how age estimation can be performed on the uncontrolled, found data. Thus, the contributions of the our study are two-fold. First, we provide a method for searching age as an additional metadata for the VoxCeleb corpora. It is based on a careful cross-combination and verification of metadata from different celebrity databases as illustrated in Fig. 1. We provide the cleaned metadata for a subset of VoxCeleb videos and for reproduciblity, all metadata and scripts used to produce results found in this paper are to be found here: https://git.io/JYKN4. Second, we present initial age recognition baselines on this data.
2 VoxCeleb Metadata
2.1 VoxCeleb Corpus
Since the first official release at Interspeech 2017, the VoxCeleb 1  dataset has been used in various tasks. The corpus consists of 153516 utterances from 1251 unique speakers, extracted from 22496 different YouTube videos. A year later, VoxCeleb 2  was released with a larger number of speakers, videos and utterances: the dataset consists of 6112 celebrities, 150480 videos and more than 1 million recordings. Along with speaker identities, the original authors provided gender and nationality information of each person. Concerning metadata, two of our contributions include (i) independent validation of the gender metadata, and (ii) extraction of age metadata.
2.2 Metadata Extraction and Cross-Validation
Gender and age values of each video are obtained by combining and filtering different data sources. First, we used YouTube API to query for the original video ID including its title, description and upload date. Since some of the original videos no longer exist, or are geo-restricted111Because of our computational environment, the used API credentials are associated with an Italian account and all queries were made from a Finnish IP Address, we were unable to collect these details for all videos. Nonetheless, of the videos’ metadata were successfully retrieved.
The next step was to retrieve personal information of all speakers in the corpus. We looked up each speaker’s name from multiple independent sources including Google Knowledge Graph
Google Knowledge Graph(GKG), DBpedia and Wikidata. Even if gender and birth date are included in these three databases, there are disagreements among them, which reflects uncertainty in the metadata. For instance, there are celebrities with similar or identical names. As an example, Paolo Ruffini (1765-–1822) was an Italian mathematician and philosopher, while another Paolo Ruffini (1978–) is a present-day Italian actor and presenter. Another source of uncertainty is added by the YouTube video descriptions that may contain wrong information. A joke example is the case of Conan O’Brien (a US television host; male) occasionally being mislabeled as Tarja Halonen (former Finnish president; female), due to joke about their likeness.
While we have little means to control the potential errors in the video title or description, we aim for precision and robustness in verifying the gender and birthday metadata. Therefore, we only accept records with unanimous consensus among the three sources, and disregard the rest. Cross-validation among independent sources substantially reduces labeling errors. Out of unique persons, had consistent gender across all three sources ( male, female, and transgender female). Concerning the birth year, however, only speakers had consistent information between GKG, DBPedia and Wikidata.
Since gender labels are already provided with the original VoxCeleb 1 and 2 corpora, we were also interested to cross-compare our independently derived labeling with that information. Out of the speakers with extracted metadata, the gender labels in the original VoxCeleb metadata agreed in cases (99.2% level of agreement). Of the disagreed cases, males were re-labeled as females and females as males with our approach. The last disagreement is a transgender female (according to our sources), labeled as female in the original VoxCeleb metadata. As a remainder, we treat our gender labeling as the correct ones; our confidence relies on the consensus of the three independent sources.
The last step was to infer the age of each speaker. In an ideal scenario this could be done by computing the difference of the recording date and the speaker’s date of birth. In this case, the result will be the exact speaker age. Unfortunately, estimating video recording time is more challenging due to the fact that it could have been recorded at anytime in the past. For simplicity, we focus on the unit of years instead of the exact dates. Even if this discretization introduces potential errors (up to 2 years for some records), it is a reasonable trade-off between complexity and accuracy for our task. To this end, we looked up for references of the upload year — the only date available — in the title and the description of the video. If the upload year was found from both fields, the hypothesized recording year of the video was set to that value. Even though title and description are free-text fields, populated by the video uploader, they are part of YouTube ranking factors. Since they provide semantic context about the video for the users, it is likely that they contain related and correct summary information for the given video. Once the recording year is recovered, the speaker age can be trivially inferred as the difference of the recording year and the speaker’s birth year. For training, videos with the upload year referenced only in the title were considered as an valid attribute. This was deemed useful for increasing the number of training observations available as additional celebrities were added, even though they were never part of the test set considering their lower reliability.
With this approach, we inferred age for speakers, and overall unique <YouTubeID, VoxcelebID, age> triplets. Note that each person can have multiple different age values specific to each video and age distribution for these unique <Speaker, Age> pairs can be found in Table 1 . For instance, Tom Cruise was interviewed in 2005, 2008 and 2014. Thus, he appeared in (at least) three recordings when he was 24, 27 and 33 years old. For any speaker that occurred in multiple videos but with different age entries, we randomly select only single age value for the speaker. This is a deliberate strategy to reduce over-presentation of certain popular speakers in both training and testing data. In this study, we were not interested in modelling how a given, fixed person’s voice changes through time. In other words, we prevent scenarios where the model will always predict the same age for a given speaker as we wish our predictors to focus only on population features.
|Age interval||Number of speakers|
3.1 Feature processing
A number of different feature representations were evaluated for the age and gender recognition tasks. As suggested in , i-vector  and x-vectors  contain high amount of relevant information of speakers, including gender and channel variations. Our first choice is to use these recording-level features. Both the i- and x-vector extractors were trained using mel-frequency cepstral coefficient (MFCC) features for speaker recognition, rather than age or gender recognition. The reason is that speaker IDs are available for a much larger collection than the age and gender labels explained above. The processes were performed by ASVTorch 
, a Python wrapper built on top of Kaldi and PyTorch.
Additionally, for the age regression task two different frame-level feature sets were used. The first one consists of logarithm of mel-frequency filter outputs, the latter the same but with an additional discrete cosine transform (DCT) — i.e., MFCCs. The settings are exactly the same as for i/x-vector models. Specifically, we use 25 ms frames, 30 mel filters and 24 MFCCs. The same processing is done also for augmented recordings, obtained by mixing the MUSAN  dataset with the original tracks to get 4 additional tracks per utterance.
Due to differing recording lengths, it was necessary to unify the length of MFCC and log-mel power features. Non-speech frames were excluded, then, we fix the desired length on the temporal axis to frames (i.e. 2 seconds) and sample random chunk of data spanning frames where the start index is drawn at random in range and being the number of frames in the utterance. During training,
is changed at each epoch in order to reduce the chance of overfitting. However for the test set the start index is picked before the training phase and is never changed in order to guarantee the reproducibility of experiments.
3.2 Predictive algorithms
In the gender recognition
task, we evaluate both shallow and deep models to investigate the impact of model size while ensuring feasible training times. Two representative back-end models where used with the recording-level i- and x-vector features: logistic regression and feed forward neural networks, with 2 and 4 hidden dense layers, having all 512 neurons apart from the output node.
For the age regression
task, in turn, we consider neural networks-based models, such as feed-forward neural networks with 2 hidden dense layers made of 512 neurons each and convolutional neural networks (1-dimensional and 2-dimensional) with optional residual connections, as well as linear, LASSO and ridge regression.
LASSO fitting solves , where and are the ground truth and prediction vectors, respectively, is the regression coefficient of the i-th feature, is the number of features (here, ) and is a regularization parameter that encourage sparsity constraint on the parameters. Ridge regression, in turn, optimizes , the additional regularization forces the parameters to be small, hence, less sensitive to the change in input features and improving generalization.
With MFCCs and power mel-spectrum inputs, most of our models have a single input and a single output. However, some of our CNNs contain multiple inputs and multiple outputs as well. Particularly, we consider both continuous (e.g. 25.0) and quantized (e.g. ‘age category 25-30’) data for the age regression task. The model architecture follows the design of siamese neural networks with shared parameters apart from the output layer. The key idea is that the quantization would balancing the label distribution among age groups. Most of the models trained for the age regression task used the mean-squared error (MSE) as loss function: the only exceptions are the multi-input multi-output models, where the loss is, where stands for cross-entropy. The weights and were chosen empirically to balance the contribution of each loss term.
4 Experimental Setup
4.1 Data split and cross-validation set-up
In order to avoid data leakage from test set to training set, several precautions were taken. First, for both tasks, the data was split between recordings where age or gender is available and the ones where these attributes are absent. This latter data, consisting of the whole VoxCeleb 1 corpus (1251 speakers), was used for training the x- and i-vector extractors.
The remaining part (with age and gender labels) is reserved for training and testing the gender and age recognizers. The data is subjected to holdout, with 60% and 40% of the data used for training and testing, respectively. Five-fold cross-validation
is used in feature and model selection, for tuning hyperparameters. Importantly, no speakers are shared between training and validation folds, or between training and test data.
For the gender recognition task, we ensure that male and female labels had equal amounts at every step, to avoid imbalance and potential gender bias. In the age regression task, instead, train and test distributions are kept close as the test set is obtained by randomly sampling speakers. During testing, for each step of cross-validation, the same number of recordings is used for each speaker to ensure that each speaker contributes equally to the metric of interest. The same principle is followed during training unless otherwise noted.
4.2 Evaluation metrics
For the gender recognition task, we evaluate models using F1-score, a commonly used metric in different binary classification tasks. F1 score is preferred over accuracy to avoid favoring models biased towards dominant class. It is computed as , where and . Here, , , and denote the number of true positives, false positives and false negatives, respectively. The convention adopted for our work represents males as the positive class and females as the negative class. This is an arbitrary convention and does not impact the reported F1 scores.
Age regression models, in turn, are evaluated using MAE, defined as , where is the number of test cases while and denote the ground-truth and predicted age, respectively. No rounding was applied to model predictions before computing this score. During training, mean F1 and mean MAE across folds across test folds were computed in order to identify the best model and training configuration.
4.3 Reference values for classification
Even if not common in age recognition studies, it is useful to reference classifier predictions to less informative systems, namely, guessing. To this end, we consider three different reference values for the MAE metric with different prior knowledge of the test data. The first approach replaces classifier predictions with random predictions drawn (with replacement) from the empirical distribution of test set ages. The process is repeated 100k times and yields an average MAE of
years. The second approach is similar but the predictions are drawn from a less informative uniform distribution instead, with the range set according to the minimum and maximum age. This yields an average MAE ofyears. Our final approach predicts a fixed value deterministically. In particular, using grid search, we found the optimal (lowest MAE) fixed age of 39 years, with the respective MAE of years.
5.1 Gender recognition
All the gender recognition experiments were balanced in terms of class and without data augmentation used in training. As Table 2 indicates, both x- and i-vectors are highly informative of the gender. This is consistent with the findings in  and the fact that a linear model (i.e. logistic regression) outperforms deeper networks indicates a linear relation between the embedding vectors and age information. Overall, the difference between the two types of embedding is minor, and the performance is less likely to be affected by channel variation since the accuracy is above 95% both in our case and in .
|Model||i-vector F1-score||x-vector F1-score|
5.2 Age regression
Several different experiments were carried out in the more challenging, age regression task. We begin by comparing linear regression, ridge and LASSO back-ends with both types of embedding features. The training and testing datasets are balanced in the number of utterances selected from each speaker. The results are reported in Table3.
|Model||i-vector MAE||x-vector MAE|
Concerning the back-end, ridge regression yield the best MAE for x-vector, while LASSO is the most promising model for i-vector. We also tried non-linear estimator (deep neural network) which however resulted in overfitting and degraded performance on the validation set. Concerning the features, x-vector yield lower MAEs in most cases.
|Model||i-Vector MAE||x-vector MAE|
We carried another similar experiment without data balancing, to find out whether using all the available training data may help in learning a better representation. Table 4 indicates the robustness of i-vector under the impact of imbalanced age distribution. This is anticipated since x-vector was trained using supervised information (i.e. speaker labels) which is directly connected to the age labels. As a result, the training procedure could incorporate unexpected inductive biases toward certain dominant age groups. The best overall result is now obtained with i-vector embedding and linear regression model.
|Model||MFCC MAE||Log Mel MAE|
|CNN 1-D Multi Output*||9.510||13.326|
Since neither i- nor x-vectors are specifically designed for age recognition, we made an attempt to learn more informative model by combining frame-level features (here, MFCCs and log-mel spectrograms) with a CNN-based model. Deep learning has been demonstrated as a powerful end-to-end framework that automatically extracting relevant information for various speech processing task [27, 1, 7]. Our best convolutional model has three convolutional layers of 30, 60 and 120 filters respectively, and two dense layers of 256 and 128 nodes, before the output node. Table. 5 shows that CNN with MFCC obtain the same MAE obtained by the best i-vector system reported.We suspect the reason why frame-level features do not lead to better results is related to the amount of training data: while the age regressor (back-end) training set is the same for all the systems in Table. 5, the i-vector extractor is trained using a large offline training data. This may help in reducing bias due to imbalanced age distribution, while data augmentation was used for the end-to-end approaches, these utterances still come from the same set of speakers with biases from dominant age groups. This issue is intensified by strong non-linearity of deep models could easily steer the model toward sub-optimal region.
Finally, we contrast our best model (i-vector features with linear regression) with the three references explained in section 4.3. The model predictions are substantially better than the MAE obtained by these approaches, an indication that the model learns to predict ages. But as Figure 2 reveals, the results vary considerably across ages: the most accurate predictions are obtained for the age range with an MAE as low as in the range of 40 years.
We suspect the non-uniformity of the age distribution itself to be the primary reason for the non-uniform performance. The number of utterances for 40 year old speakers outnumbers the data available for the other ages and model predictions are confined to a narrower range of years. This is somewhat surprising considering that humans are usually good at identifying ‘broad age ranges’ — and certainly young and old people are not easily exchanged for middle age-ones. Note also that prediction errors around the median age values are also lower for the reference classifier: by drawing a random number between the minimum and maximum ages yields ‘better’ predictions for the medium ages than near the maximum and minimum ages.
We have demonstrated how to extend the usage of Voxceleb dataset for additional tasks like age and gender recognition. Our pilot experiments indicate high recognition performance in the gender recognition task: the F1 score of our best model was 0.9829. Age regression, however, turned out a much more complex task. The best achieved overall MAE was
. While substantially lower than baseline values obtained by random guessing or fixed age predictions, the error remains high. Considering high variability of MAE across ages, future study could focus on better data augmentation using self-supervised learning or Bayesian approach that addresses the biases via age distribution prior. We also release the metadata to be used for improving speech processing tasks using the VoxCelebs dataset.
This work has been partially sponsored by Academy of Finland (proj. no. 309629).
Deep speech 2: end-to-end speech recognition in english and mandarin.
Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 173–182. Cited by: §5.2.
-  (2014-09) Speaker age estimation using i-vectors. Eng. Appl. Artif. Intell. 34 (C), pp. 99–108. External Links: Cited by: §1.
-  (2017) Using ranking-cnn for age estimation. In CVPR, Vol. , pp. 5183–5192. Cited by: §1.
-  (2020) Multi-Modality Matters: A Performance Leap on VoxCeleb. In Proc. Interspeech 2020, pp. 2252–2256. External Links: Cited by: §1.
-  (2018) VoxCeleb2: deep speaker recognition. In INTERSPEECH, Cited by: §1, §2.1.
-  (2011) Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 19 (4), pp. 788–798. External Links: Cited by: §3.1.
-  (2020) AutoSpeech: Neural Architecture Search for Speaker Recognition. In Proc. Interspeech 2020, pp. 916–920. External Links: Cited by: §5.2.
-  (2020) AutoSpeech: Neural Architecture Search for Speaker Recognition. In Proc. Interspeech 2020, pp. 916–920. External Links: Cited by: §1.
-  (2011) Speaker age and vowel perception. Language and Speech 54 (1), pp. 99–121. Note: PMID: 21524014 External Links: Cited by: §1.
-  (2017) Acoustical and perceptual study of voice disguise by age modification in speaker verification. Speech Communication 95, pp. 1–15. External Links: Cited by: §1.
-  (2019) On the limits of automatic speaker verification: explaining degraded recognizer scores through acoustic changes resulting from voice disguise. The Journal of the Acoustical Society of America 146 (1), pp. 693–704. External Links: Cited by: §1.
-  (2020) End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors. In Proc. Interspeech 2020, pp. 269–273. External Links: Cited by: §1.
-  (2017) Effects of gender information in text-independent and text-dependent speaker verification. In ICASSP, Vol. . Cited by: §1.
-  (2019) I4U Submission to NIST SRE 2018: Leveraging from a Decade of Shared Experiences. In Proc. Interspeech 2019, pp. 1497–1501. External Links: Cited by: §1.
-  (2021) ASVtorch toolkit: speaker verification with deep neural networks. SoftwareX 14, pp. 100697. External Links: Cited by: §3.1.
-  (2017) VoxCeleb: a large-scale speaker identification dataset. In INTERSPEECH, Cited by: §1, §2.1.
-  (2016) Ordinal regression with multiple output cnn for age estimation. In CVPR, Vol. , pp. 4920–4928. Cited by: §1.
Mean-variance loss for deep age estimation from a face. In CVPR, Vol. , pp. 5285–5294. Cited by: §1.
Probing the information encoded in x-vectors.
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Vol. , pp. 726–733. External Links: Cited by: §3.1, §5.1.
-  (2016) Speaker age estimation on conversational telephone speech using senone posterior based i-vectors. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5040–5044. External Links: Cited by: §1.
New cosine similarity scorings to implement gender-independent speaker verification. In Interspeech, Vol. . Cited by: §1.
-  (2018) Regression forests for age estimation. In CVPR, Vol. , pp. 2304–2313. Cited by: §1.
-  (2020) Towards Learning a Universal Non-Semantic Representation of Speech. In Proc. Interspeech 2020, pp. 140–144. External Links: Cited by: §1.
-  (2015) Exploring ann back-ends for i-vector based speaker age estimation. In Proceedings of Interspeech 2015, Vol. 2015, pp. 3036–3040 (english). External Links: Cited by: §1.
-  (2018) X-vectors: robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5329–5333. External Links: Cited by: §3.1.
-  (2015) MUSAN: A Music, Speech, and Noise Corpus. Note: arXiv:1510.08484v1 External Links: Cited by: §3.1.
-  (2016) Deep language: a comprehensive deep learning approach to end-to-end language recognition. In Odyssey 2016, pp. 109–116. External Links: Cited by: §5.2.
Fast bi-layer neural synthesis of one-shot realistic head avatars.
European Conference of Computer vision (ECCV), Cited by: §1.