Towards Learning a Universal Non-Semantic Representation of Speech

02/25/2020 ∙ by Joel Shor, et al. ∙ 0

The ultimate goal of transfer learning is to reduce labeled data requirements by exploiting a pre-existing embedding model trained for different datasets or tasks. While significant progress has been made in the visual and language domains, the speech community has yet to identify a strategy with wide-reaching applicability across tasks. This paper describes a representation of speech based on an unsupervised triplet-loss objective, which exceeds state-of-the-art performance on a number of transfer learning tasks drawn from the non-semantic speech domain. The embedding is trained on a publicly available dataset, and it is tested on a variety of low-resource downstream tasks, including personalization tasks and medical domain. The model will be publicly released.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the most powerful uses of deep learning is finding a good representation for a given domain. Despite progress on representations in the visual domain

(Zhai et al., 2019) and the language domain (Wang et al., 2018, 2019), no such universal representation exists for the speech domain. One reason is a lack of standard benchmark tasks to compare different methods; for example, existing speech representations tend to focus on one problem set such as speaker recognition or speech emotion recognition (Latif et al., 2020). In this paper, we propose a set of benchmark speech tasks that are diverse, to require that “good” representations contain general speech information, and targeted, to allow good performance when compared with task-specific representations.

We propose a specific set of publicly available tasks, called the “NOn-Semantic Speech benchmark” (NOSS), to assess the general usefulness of speech representations on ”non-semantic” tasks. “Non-semantic” tasks do not include tasks like automatic speech recognition and phone classification, which require sub-second granularity. They do include paralinguistic tasks such as speech emotion recognition, as well as tasks such as speaker identification, language identification, and medical diagnosis. In addition, we introduce a new set of intra-speaker sub-tasks from existing tasks where a model is trained and evaluated on speech from a single speaker. These intra-speaker tasks help measure which representations are most useful for personalization, which is an increasingly-relevant use-case as more computation is performed on-device.

A good speech representation should be high-performing on a diverse set of downstream tasks using simple models. In addition, it should be useful in transfer learning with small amounts of data for a new task. This use-case is relevant for model personalization, such as user-specific emotion recognition or speaker identification. If the model that generates the representation is small, it can be used for on-device learning. Transfer learning from a larger dataset is also useful in the medical domain, where data cannot be shared as easily due to patient privacy. A generally-useful representation for speech would benefit medical speech researchers who do not have access to large amounts of medical audio data.

There are a number of approaches to building speech representations, either using hand-crafted or learned features. OpenSmile (Eyben et al., 2010) is a very popular library that extracts non-learned, signal processing-based features from audio. It is the standard classical front-end for a wide range of non-semantic speech tasks. Previous attempts to learn a DNN-based representation have leveraged various techniques including supervised training, self-supervision, predictive coding, and multimodal coincidence.

We introduce a representation, TRILL (TRIpLet Loss network), which is learned in a self-supervised way on the speech portion of AudioSet

(Gemmeke et al., 2017). Using the techniques of (Jansen et al., 2017), the network represents audio such that segments which are closer in time are also closer in the embedding space. We demonstrate that this simple proxy objective is highly effective in learning a strong representation for non-semantic speech.

We evaluate TRILL and other representations on our benchmark by training small models built on top of the representations and comparing their performances. In addition, we explore transfer learning by fine-tuning TRILL using data from the downstream tasks. This is an advantage of learned representations over non-learned ones. Transfer learning can sometimes outperform representations for the same model (Zhai et al., 2019), and this is also the case in our benchmark. Using transfer learning, we are able to achieve a new state-of-the-art in many of the tasks, surpassing previously published results which sometimes were hand-crafted for those specific datasets.

In summary, our contributions are:

  1. We define a new benchmark for comparing representations on non-semantic speech tasks using previously published data. In addition, we add a sub-category of personalization tasks.

  2. We demonstrate that a single representation learned in an unsupervised manners performs best on this benchmark. We compare it to existing representations, feature-based and learned.

  3. We fine-tune our best performing representation, further boosting results. This method sets a new state-of-the-art on many previously published tasks from our benchmark.

  4. We distill our learned representation to a model that can run inference and training on-device, and we open-source the original and distilled models.

2 Background

Transfer learning and domain adaptation have been extensively studied in machine learning (Pan and Yang, 2009). Recent research has mostly focused on deep representation learning methods, either supervised, semi-supervised, or unsupervised. Successful representations improve the sample efficiency of ML algorithms by extracting most information out of the raw signal from the new data before any task-specific learning takes place. This strategy has been used successfully in many application domains (Tan et al., 2018; Cheplygina et al., 2019).

An important step in learning a good representation is having a standard benchmark to evaluate it. Such benchmarks should contain a variety of downstream tasks, representing different tasks in the domain. Such benchmarks have been developed in vision and NLP (Zhai et al., 2019; Wang et al., 2019).

There are three standard approaches to adapting a representation to multiple, potentially heterogeneous, downstream tasks. One approach is to train a task-specific linear classifier on the embeddings produced by a pre-trained network, whose parameters are kept frozen 

(Yosinski et al., 2014; Donahue et al., 2013). A second approach is to fully fine-tune (Cui et al., 2018), in which a pre-trained network is used as starting point for the end-to-end training process. Generally, fine-tuning matches or outperforms the performance of fully-supervised models trained on the downstream tasks (Zhai et al., 2019; Kong et al., 2019), especially when the amount of labeled data is small. A third approach is multi-task learning. This has been applied in the speech domain (Zhang et al., 2019; Pascual et al., 2019), although not on a wide range of tasks. It is usually favored when the downstream tasks are all applied on the same input set.

There are many methods for learning audio representations. (Kong et al., 2019) trained an embedding for audio classification on AudioSet which transfered well to non-speech, downstream tasks. Other work has demonstrated the value of supervised (Hershey et al., 2016; Pascual et al., 2019), semi-supervised (Parthasarathy and Busso, 2018), or unsupervised representations. The unsupervised audio representation literature is especially diverse (L (Arandjelovic and Zisserman, 2017), AuDeep (Freitag et al., 2017), Autoregressive Predictive Coding (Chung et al., 2019; Chung and Glass, 2020), Contrastive Predictive Coding (van den Oord et al., 2018), metric learning (Jansen et al., 2017)

, autoencoding 

(Latif et al., 2018)). However, these methods were evaluated on just one or a limited set of downstream tasks.

In other domains, training a strong representation requires a very large, general dataset. AudioSet (Gemmeke et al., 2017)

is the largest dataset for general purpose audio machine learning, serving as an audio equivalent of ImageNet. Even when restricted to only the samples with speech tags, it surpasses all datasets in size and variability. It has been used to learn a general purpose audio embedding in

(Kong et al., 2019), and can be used for multiple speech tasks.

3 Non-Semantic Speech Benchmark (NOSS)

Dataset Has Target Number of Number of Number of Average
intraspeaker classes samples speakers duration (secs)
VoxCeleb No Speaker identification 1,251 153,514 1,251 8.2
VoxForge No Language identification 6 176,438 13,559 5.8
Speech Commands Yes Command 12 105,829 2,618 1.0
CREMA-D Yes Emotion 6 7,442 91 2.5
SAVEE Yes Emotion 7 480 4 3.8
DementiaBank No Dementia/healthy 2 210 210 70.0
Table 1: Datasets for downstream benchmark tasks.

To standardize the assessment of non-semantic speech representations, we introduce NOSS. This section describes the benchmark in detail (summarized in Table 1). These tasks reflect different properties of the speech signal, and they vary in size and difficulty. Personalization and on-device training is increasingly important, so we include “Intra Speaker” tasks for some of the datasets, when applicable. Intra-speaker tasks are an important addition because they also test task adaptation for small amounts of data, and that representations do not just rely on speaker identity.

3.1 Inter-speaker tasks

3.1.1 VoxCeleb

VoxCeleb (Nagrani et al., 2017) is an audio-visual speaker recognition dataset from YouTube videos. We used VoxCeleb1, which contains 153,514 utterances for 1,251 celebrities.

3.1.2 VoxForge

VoxForge (MacLean, 2018) is a website containing user-submitted audio clips in various languages, and it can be used to create a language classification task. We follow the settings of (Revay and Teschke, 2019), which collected utterances from 6 languages - English, Spanish, French, German, Russian, and Italian, for a total of 176,438 utterances from 13,559 speakers. Note that VoxForge is constantly updating, and so these numbers are expected to change in the future.

3.1.3 Speech Commands

Speech Commands (Warden, 2018) is a different addition to our benchmark, in the sense it contains limited semantic information. However it only tests for 12 different classes, 10 of them are full words, one of them is silence, and the last is “unknown” which is a collection of 26 different words. We chose to include this dataset since it allows us to test a model’s understanding of a more rapidly changing phenomenon than other tasks in the NOSS benchmark (it has the smallest average duration at 1 second). This dataset contains approximately 100K utterances in total, recorded by 2,618 different speakers.

3.1.4 Crema-D

CREMA-D (Cao et al., 2014) is an audio-visual emotion expression dataset. The dataset consists of facial and vocal emotional expressions in sentences spoken in a range of basic emotional states (happy, sad, anger, fear, disgust, and neutral), with 7,442 clips of 91 different actors. The best published result on this dataset uses both audio and visual information (Ghaleb et al., 2019), but our method only uses the audio modality.

3.1.5 Savee

SAVEE (Haq et al., 2009) is also an emotion recognition task. It contains 4 male actors reading sentences with 7 emotions: anger, disgust, fear, happiness, neutral, sadness and surprise. The database consists of 120 utterances per actor and 480 sentences in total. The data also contains audio-visual recordings, and like in CREMA-D we only use the audio modality. SAVEE is smaller than CREMA-D, making it more suitable for testing the representation in limited data settings.

3.1.6 DementiaBank

DementiaBank (Boller and Becker, 2005) is a medical domain task. It contains 117 people diagnosed with Alzheimer Disease, and 93 healthy people, reading a description of an image, and the task is to classify these groups. It demonstrates the difficulties faced when trying to train models for the medical domain, namely very small amount of data and some non-standard speech. Although it is a very noisy dataset, it is one of the few examples of an audio medical dataset in the public domain.

3.2 Intra-speaker tasks

An important use-case of task adaptation is personalization—training a model on data of a specific person, and evaluating only on that person. The accuracy is averaged over all speakers. Note that most inter-speaker tasks divide the speakers into disjoint groups in the train/test split, and in the intra-speaker case all speakers are used for both training and testing. Not all tasks have a meaningful intra-speaker versions; it is meaningless to train and test on the same speaker in tasks with labels that depend only on the speaker’s identity, such as language identification, medical diagnosis, and speaker identification. The intra-speaker tasks are: CREMA-D, SAVEE, and Speech Commands.

4 Triplet-based representation

Non-semantic aspects of the speech signal (e.g. speaker identity, language, and emotional state) generally change more slowly than the phonetic and lexical aspects used to explicitly convey meaning. Therefore, we expect a good representation for non-semantic downstream tasks to be considerably more stable in time than what is required for ASR applications. However, at sufficiently long time scales (e.g. across days or environments), we would expect the person talking and the context to change rather dramatically. Thus, we can expect rough temporal proximity of audio clips to be weakly predictive of geometric proximity of latent factors that characterize nonsemantic content. To take advantage of this intuition, we follow the work of (Jansen et al., 2017) Section 3.2.5 and use temporal proximity as a self-supervised signal. However, instead of grouping audio segments closer than seconds, we cluster segments drawn from the same audio sample.

More formally, consider a large, unlabeled speech collection represented as a sequence of spectrogram context windows , where each . Our goal is to learn a map from spectrogram context windows to -dimensional space such that when . We can express this desired relationship as a learning objective using triplet loss-based metric learning as follows. First, we sample from a large collection of example triplets of the form (the so-called anchor, positive, and negative examples), where and for some suitably chosen time scale . The loss incurred by each triplet is then given by

where is the norm, is standard hinge loss, and

is a nonnegative margin hyperparameter.

The triplet loss objective is amenable to stochastic gradient descent optimization, but it is well-known that progress can quickly plateau if the triplet examples are not particularly difficult to satisfy. We thus employ the now-standard within-batch semi-hard negative mining technique 

(Schroff et al., 2015), which involves applying the current state of to all triplets in a batch and reassigning negatives to anchor-positive pairs that will continue to incur loss penalty (i.e. negatives that are “hard”). However, choosing the hardest negative reassignment is subject to label noise, so the semi-hard strategy backs off to selecting the closest negative to the anchor that remains further than the positive.

In the present case, the temporal proximity-based supervisory signal is extremely weak with respect to particular downstream applications. It expresses only an expected property of nonsemantic speech representations, but it is far from strictly true in a large uncurated speech collection and a fixed . Therefore, to still succeed in learning something generally useful we must rely on a very large data scale to boost the strength of the supervisory signal. AudioSet (Gemmeke et al., 2017), while collected for the development of large-scale audio event model training and evaluation, contains speech in over half of its over 5,000 hours of audio. This includes speech taken from hundreds of thousands of voices in nearly as many distinct natural contexts. Since each AudioSet clip is typically 10 seconds, we simply set to cover whole clips. This sampling strategy does not produce particularly difficult negatives, making the semi-hard mining technique critical for successful optimization.

Following (Hershey et al., 2016; Jansen et al., 2017), we (i) take as input log mel spectrogram context windows with mel bands and frames representing 0.96 s of input audio (STFT computed with 25 ms windows with step 10 ms); and (ii) employ the Resnetish (Hershey et al., 2016) variant of the standard ResNet-50 architecture followed by a

dimensional embedding layer. We length normalize each embedding before calculation of the triplet loss, which transforms squared Euclidean distance into cosine distance. Notably, batch normalization is not used due to the biased sampling involved in triplet construction 

(Ioffe, 2017). We use the Adam optimizer with learning rate of .

Finally, note that the average pooling operation present in the ResNet architecture before the final fully connected layer destroys the sub-second temporal structure of our learned representation, which may be suboptimal for some downstream tasks. Therefore, we also consider representations defined by earlier convolutional blocks after the full Resnetish embedding model has been fully trained. These internal layers produce 3-tensor outputs (time

frequency channels) that in our experiements are flattened.

5 Experiments

Speech Dementia
VoxCeleb1 VoxForge Commands CREMA-D SAVEE Bank
Prev SOTA 80.51 892 91.1 3 74 4 615 80.66
Mel / MFCC 15.5 79.2 47.7 52.8 55.6 67.7
OpenSmile 2.3 78.0 36.5 53.7 62.6 68.8
Random Network 12.0 73.0 42.0 52.1 48.6 67.9
YAMNet top 13.7 67.0 40.7 52.2 45.4 64.8
YAMNet layer 10 41.5 86.1 73.1 66.4 62.3 70.0
VGGish top 14.1 80.8 28.3 51.3 49.8 68.3
VGGish FCN 1 33.8 85.1 52.7 55.7 57.7 68.7
TRILL top 33.9 83.8 60.4 64.9 53.7 68.2
TRILL layer 19 48.9 88.1 74.0 67.8 67.8 67.2
TRILL layer 19, MobileNet 2048d 44.6 83.4 74.9 68.1 60.0 68.1
TRILL finetuned 44.6 94.1 91.2 69.5 68.6 73.1
(Nagrani et al., 2017) (Revay and Teschke, 2019) (Kaggle, )
(Ghaleb et al., 2019) uses audio-visual features (Haq et al., 2009) (Noorian et al., 2017) uses textual features
Table 2: Average performance of the different embeddings on a number of downstream tasks.

5.1 Representations

To test our new representation, we compare it to a number of existing methods on the benchmark.

5.1.1 Log-magnitude mel spectrograms and MFCCs

The most common representations in the speech domain are log-magnitude mel spectrograms or features derived from them (Davis and Mermelstein, 1980)

. They are inspired by auditory and physiological findings of how humans perceive speech signals, and have been successfully used in speech research in the past decades. Even today, they serve as the first choice for some classification tasks because of their simplicity and usefulness. They are derived from a Short Time Fourier Transform of the audio samples. MFCCs are a compact representation that are derived from the log-magnitude mel spectrogram by an additional discrete cosine transform.

In our experiments, we treat Mel and MFCC differently than the other embeddings. Instead of using one configuration and testing it on all of the datasets, we test many different configurations (number of bins, and using either Mel or MFCC or combination), and report on each dataset the results with the best configuration. In all our configuration we use the same STFT parameters as described in section 4.

5.1.2 OpenSmile features

OpenSmile (Eyben et al., 2010)

is the de-facto state of the art for feature extraction

(Cummins and Schuller, 2019). Specifically, it is still more popular than learned representation in the areas of emotion recognition (Latif et al., 2020) and many medical tasks. It is a flexible library capable of extracting many kinds of features including spectral and prosodic ones. We use the ComParE16 acoustic parameter set (Schuller et al., 2016) in our evaluation, which consists of 6,373 features resulting from various functionals over low-level descriptors.

5.1.3 Randomly initialized networks

Randomly initialized networks have recently been shown to produce good embeddings (Ulyanov et al., 2018), due to their compression power. In (Tian et al., 2019; Michelashvili and Wolf, 2019; Zhang et al., 2020), this technique has been demonstrated in audio processing as well, including music classification, speech separation and ASR. In our benchmark we add a comparison to our own network with random initialization, and training a simple model (as described in Section. 5.2.1) on top of it.

5.1.4 Learned representations

We also run evaluation of two existing open-source models trained for audio event classification. YAMNet (Plakal and Ellis, 2020) is an implementation of MobileNet trained on AudioSet for audio classification. In (Kong et al., 2019), the authors report that this network gets comparative results to the state of the art in general non-speech audio classification. We chose to compare to it because of its simplicity and of its light-weight implementation, both desired features of such a representation. In addition, it is the only learned representation we are aware of which was trained on AudioSet. VGGish (Hershey et al., 2016) is an audio embedding produced by training a modified VGGNet model to predict video-level tags from the Youtube-8M dataset. As in the case of YAMNet, it is also a simple embedding trained on a very large and diverse corpus, but it is less focused on audio events. Both of these representations use the same time 0.96 s input context window window as the TRILL network and the same input log-magnitude mel spectrogram features.

5.2 Experimental Method

In order to evaluate the usefulness of the representations described in Section 5.1, we train a number of small models to try to solve the downstream tasks that are a part of the NOSS benchmark (Section 3). For each representation / task pair, we explore different downstream models, representation aggregation techniques, and normalization methods. We describe the details in the following sections.

5.2.1 Downstream models

To measure the usefulness of the speech representations, we train shallow models using Scikit-Learn (Pedregosa et al., 2011)

. We experimented with logistic regression, random forest classifiers, and linear discriminator analysis (LDA) as the downstream models. Despite the shallowness of the above models, we achieve competitive results with the best previously-reported results on some of the benchmark tasks.

5.2.2 Feature aggregation over time

We also experiment with different ways of aggregating information across the time dimension. We tried either aggregating the embeddings, or aggregating the predictions of the models across time.

To help models perform well on tasks where the relevant signal is localized in time, we also experiment with two aggregation functions: average and max. Since the predictions are discrete classes, we only aggregate by taking the mode when we are aggregating over predictions. This is akin to computing functionals on low-level descriptors in the traditional speech feature literature (Eyben et al., 2010)

or the pooling layers of the neural networks in

(Hershey et al., 2016).

5.2.3 Feature normalization

The representations are all dense representations, but have very different statistics. We experimented with using the raw representation for classification, as well as normalization to the unit-norm and speaker-dependent normalization as in (Vlasenko et al., 2007).

5.2.4 Data Splits

Some of the downstream tasks have fixed canonical splits. For the inter-speaker tasks on those datasets, we report numbers on the canonical splits (Speech Commands and VoxCeleb). For the other datasets, and for all the intra-speaker tasks, we perform five random train / test splits and report the average. For the downstream tasks that do not predict the speaker ID, we split the data such that samples from a single speaker are either all in the train set or all in the test set. We split the data into 70% in train and 30% in test (approximately, due to keeping speaker samples in the same group). For intra-speaker tasks (Section 3.2), we train and test on one speaker at a time, then average results across splits and across speakers.

5.2.5 Neural network layer

For the representations generated from pre-trained neural networks (TRILL, VGGish, YAMNet), we experimented with both the final output and two intermediate representations. We hypothesized that layers closer to the output would be more suited to tasks that were more similar to the particular training loss, but ultimately chose the representation for each network that performed best on our downstream tasks. For TRILL, we tried the final 512-dimensional embedding layer and and the pre-ReLU output of the first 512-depth convolutional layer (subsequently referred to as layer 19 due to TensorFlow convention). We found that layer 19 performed best on our tasks. For VGGish, we tried the final layer, the first fully-connected layer, and the last max-pool layer. The first fully-connected layer performed best. For YAMNet, we test the final pre-logit layer, the fifth depth-separable convolutional layer output (layer 10, shape

), and the fourth regular convolutional layer output (layer 7, shape ). We found that layer 10 performed best on our downstream tasks.

5.2.6 Model Distillation

To help make the network more useful, we attempted to train a smaller model while keeping similar quality using distillation. Model distillation was performed by training a truncated MobileNet architecture (Howard et al., 2017) to predict the original TRILL network’s layer 19 embeddings. Specifically, we remove the layers beyond the final depth-512 convolutional layer to ensure the output tensor shape matches that of TRILL’s layer 19 (D). In addition, to reduce dimension, we attach to the truncated MobileNet a bottleneck layer of size 2,048, followed by a fully-connected layer back to the original layer 19 size of dimension 12,288, where we apply a mean squared error loss. Not including the additional bottleneck layer, this distillation reduces model parameters and multiplies by factors of 5.6X (9M 1.6M) and 25X (1.5B 59M), respectively.

Mel / MFCC 47.7 77.2 73.3
OpenSmile 47.6 67.8 72.2
Random Network 43.4 73.9 69.0
YAMNet top 39.1 64.4 70.3
YAMNet layer 10 50.8 79.2 77.6
VGGish top 43.1 68.7 70.2
VGGish FCN 1 50.5 77.9 73.1
TRILL top 56.8 83.9 72.2
TRILL layer 19 57.0 84.7 74.7
TRILL distilled 56.8 84.7 74.7
Table 3: Intra-speaker task performance

5.2.7 Fine-tuning

In some cases there might be a domain mismatch between the data distribution used to train the embeddings and that of a specific downstream task. Under these circumstances, shallow models trained on top of frozen embeddings do not have enough degrees of freedom to adapt to the target domain distributions. An alternative approach consists of fine-tuning the entire model end-to-end.

In our fine-tuning experiments, we created a model that consists of the sequence of an encoder and a linear head. The encoder receives as input an audio waveform and produces the corresponding embedding. The linear head is a fully connected layer that receives an embedding as input and produces the logits of the output classes. Note that the number of parameters of the linear head depends on the specific combination of representation and downstream task, since it is equal to , where is the dimensionality of the embedding (after flattening the time and frequency dimensions, if needed) and is the target number of distinct classes for the downstream task. The parameters of the linear head are randomly initialized.

Some of the benchmark tasks have relatively small amounts of data. Therefore, we observed that fine-tuning can easily lead to overfitting in these cases, due to the mismatch between the model capacity and data availability. To prevent overfitting we rely on early stopping. During training, we compute the model accuracy on the validation split, and we keep track of the model checkpoint that achieves the highest accuracy. We then use this checkpoint to compute the accuracy on the test split.

For fine-tuning we use a batch size equal to 256 and the Adam optimizer with various learning rates. We started with , then gradually dialed it back when overfitting was happening too quickly. We ended up with the following values: for the larger datasets (VoxForge, VoxCeleb and Speech Commands), for the medium-sized datasets (CREMA-D), and for SAVEE and Dementiabank. We train for a total number of 100k iterations, but early stopping happens much earlier for the smaller datasets.

We also experimented with intra-speaker fine-tuning. For this we used the CREMA-D dataset, with fixed train/dev/test splits of 80%/10%/10%. We first fine-tuned a common model on the training partitions of all the speakers, then further fine-tuned on each speaker separately. We used a learning rate of and 100k iterations for the global model, then and 10k iterations for per-speaker models, due to the limited amount of data.

6 Results

Figure 1: Effect of model on accuracy (

). A linear regression on the observed accuracies, with both the model and task as the explanatory variables. The effect a model has on the accuracy is the coefficient associated with the model in the regression. For a given task, when changing from one model to another, the resulting change in accuracy is expected to be the difference in

values in this figure.

TRILL outperforms previously reported results on three of the six benchmark tasks, and is competitive with previous best on two of the three remaining tasks. On these two tasks, the best previously reported numbers use other modalities in addition to audio (visual or textual features) (Table 2). Of the representations we compared against, TRILL performed the best on five-of-six tasks (Table 2) and two-of-three intra-speaker tasks (Table 3). We successfully distilled TRILL to a much smaller model that can be trained and run inference on a mobile device. The distilled model has no performance degradation on five-of-nine tasks, statistically insignificant degradation on one, and minor degradation on the remaining tasks.

We fit logistic regression, random forests, or LDA to various representations and compared their performances on our benchmark tasks, as described in Section 5.2. On log-magnitude mel spectrogram and MFCC features, we found that the average performance of random forest models was significantly better than the others, and for the randomly-initialized networks LDA performed best. In all other cases logistic regression either matched or beat the alternatives.

Normalization per speaker, on the datasets where it is applicable, increases average performance by more than one standard deviation of the mean difference, for all datasets, for all model types, so the numbers reported on Table.

2 are with speaker normalization.

We compared aggregating predictions in different ways to help downstream models take advantage of signals over time in different ways. For the datasets in our non-semantic benchmark, we found that pooling representations either produced the best point estimate, or was within one standard deviation of the best result (variance produced by training and testing on five different splits of the data). We also compared two strategies of pooling the representations over time, average or max. In all of the datasets we found that either average pooling performs better or there is no difference between the two.

6.1 Aggregating Scores Across Tasks

Comparing models across different tasks is an ill-defined problem in general, since different tasks have different statistical properties, different number of classes and so on, so taking the mean precision across the tasks is not always a good proxy for analyzing which model performs best. We overcome this by fitting a linear regression on the observed accuracies, with both the model and task as the explanatory variables. The observed accuracies are explained quite well by the linear regression, achieving an . The coefficient of the different models in the regression are estimates on the effect the model has on the accuracy regardless of choice of task. The full results are shown in Figure 1.

6.2 Model distillation

The results of the distilled model are presented in one before last lines of Table 2. We see that with the exception of language identification on VoxForge and speech emotion recognition on SAVEE, the reduction in model capacity and dimensionality has performance within the standard deviation of the larger embedding (variance calculated over 5-fold data splits). In the personalized tasks the quality of the distilled model is the same as the larger model.

7 Analysis

7.1 Effectiveness of fine-tuning a strong embedding

As can be seen in Table 2, fine-tuning the final embedding gives a clear boost on most tasks and sets a new state-of-the-art in 3 out of the 6 datasets. This approach of learning a strong representation on a large benchmark and then fine-tuning it for the downstream task has been proven very effective in other domains (Zhai et al., 2019), and in this paper we demonstrate the same is true in the speech domain.

7.2 Using an intermediate layer

An important observation of our research is that the effective representation learned by both YAMNet and the Triplet net is not at their final layer, but in fact in one of their intermediate layers. This intermediate layer, then, must capture information which is later discarded in the final layers. The reasoning might be that when learning the triplet loss, the network might learn to discard properties which vary temporally such as tone of voice or semantic information, but this embedding is still learned in the intermediate layer. Another evidence for that can be seen in the results on the Speech Commands dataset, where the performance of the intermediate layers of all of the learned representations is much better than the performance of their respective top layer.

7.3 The effect of time resolution

To further test our hypothesis on sensitivity to time resolution, we tested our representation on phoneme recognition from TIMIT (Lopes and Perdigao, 2011). This task operates in a smaller time scale than our other tasks in the NOSS benchmark. The accuracy of the TRILL embedding was 71.7 precision, significantly worse than the best result published in (Michalek and Vaněk, 2018) which is 83.1.

7.4 Intra-speaker fine-tuning

TRILL frozen 57.0
TRILL global fine-tuning 69.6
TRILL per-speaker fine-tuning 73.2
Table 4: Intra-speaker fine-tuning

Table 4 shows that fine-tuning per speaker allows to further personalize to each speaker, and generally improve accuracy. When breaking down the performance impact on each speaker, we can see per-speaker fine-tuning improves accuracy for 31 speakers, is mostly unchanged for 49 speakers, and decreases for 12 speakers.

7.5 Performance of randomly initialized networks

As noted in section 5.1.3, randomly initialized networks have been known to generate good priors for classification. Our experiments add yet another evidence to this fact, showing that in all of the tasks in our benchmark it gives significant results, and in some cases the results match those of the Mel spectrogram, the very well known audio features.

8 Conclusions

In this work, we explore the importance of clearly defining benchmarks when comparing representations of speech. We propose NOSS (Section 3) as a collection of publicly-available tasks to help the research community focus on non-semantic representations, and we introduce a sub-category of personalization tasks to help the research community measure progress in the age of on-device computation. We also demonstrate that TRILL (Section 4), based on an self-supervised training criteria, simultaneously performs well on all tasks in the benchmark. We show that finetuning TRILL on a small amount of data outperforms or is competitive with almost all previously reported numbers for the NOSS tasks, and that TRILL is significantly better than other representations. Finally, we distill TRILL to be compact and on-device with very little or no performance loss. TRILL and the distilled version will both be made publicly available.


  • R. Arandjelovic and A. Zisserman (2017) Look, listen and learn. CoRR abs/1705.08168. External Links: Link, 1705.08168 Cited by: §2.
  • F. Boller and J. Becker (2005) Dementiabank database guide. University of Pittsburgh. Cited by: §3.1.6.
  • H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma (2014) CREMA-D: crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing 5 (4), pp. 377–390. Cited by: §3.1.4.
  • V. Cheplygina, M. de Bruijne, and J. P. Pluim (2019) Not-so-supervised: a survey of semi-supervised, multi-instance, and transfer learning in medical image analysis. Medical image analysis 54, pp. 280–296. Cited by: §2.
  • Y. Chung and J. Glass (2020) Generative pre-training for speech with autoregressive predictive coding. In ICASSP, Cited by: §2.
  • Y. Chung, W. Hsu, H. Tang, and J. Glass (2019)

    An unsupervised autoregressive model for speech representation learning

    Interspeech 2019. External Links: Link, Document Cited by: §2.
  • Y. Cui, Y. Song, C. Sun, A. Howard, and S. J. Belongie (2018) Large scale fine-grained categorization and domain-specific transfer learning. CoRR abs/1806.06193. External Links: Link, 1806.06193 Cited by: §2.
  • N. Cummins and B. W. Schuller (2019) Latest advances in computational speech analysis for mobile sensing. In Digital Phenotyping and Mobile Sensing, pp. 141–159. Cited by: §5.1.2.
  • S. Davis and P. Mermelstein (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE transactions on acoustics, speech, and signal processing 28 (4), pp. 357–366. Cited by: §5.1.1.
  • J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell (2013) DeCAF: A deep convolutional activation feature for generic visual recognition. CoRR abs/1310.1531. External Links: Link, 1310.1531 Cited by: §2.
  • F. Eyben, M. Wöllmer, and B. Schuller (2010) OpenSMILE: the Munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia, pp. 1459–1462. Cited by: §1, §5.1.2, §5.2.2.
  • M. Freitag, S. Amiriparian, S. Pugachevskiy, N. Cummins, and B. W. Schuller (2017)

    AuDeep: unsupervised learning of representations from audio with deep recurrent neural networks

    CoRR abs/1712.04382. External Links: Link, 1712.04382 Cited by: §2.
  • J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter (2017) Audio set: an ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. Cited by: §1, §2, §4.
  • E. Ghaleb, M. Popa, and S. Asteriadis (2019) Multimodal and temporal perception of audio-visual cues for emotion recognition. In 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 552–558. Cited by: §3.1.4, Table 2.
  • S. Haq, P. J. Jackson, and J. Edge (2009) Speaker-dependent audio-visual emotion recognition.. In AVSP, pp. 53–58. Cited by: §3.1.5, Table 2.
  • S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. W. Wilson (2016) CNN architectures for large-scale audio classification. CoRR abs/1609.09430. External Links: Link, 1609.09430 Cited by: §2, §4, §5.1.4, §5.2.2.
  • A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017)

    MobileNets: efficient convolutional neural networks for mobile vision applications

    CoRR abs/1704.04861. External Links: Link, 1704.04861 Cited by: §5.2.6.
  • S. Ioffe (2017) Batch renormalization: towards reducing minibatch dependence in batch-normalized models. In Advances in neural information processing systems, pp. 1945–1953. Cited by: §4.
  • A. Jansen, M. Plakal, R. Pandya, D. P. W. Ellis, S. Hershey, J. Liu, R. C. Moore, and R. A. Saurous (2017) Unsupervised learning of semantic audio representations. CoRR abs/1711.02209. External Links: Link, 1711.02209 Cited by: §1, §2, §4, §4.
  • [20] Kaggle TensorFlow speech recognition challenge. Note: 2020-01-28 Cited by: Table 2.
  • Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley (2019)

    PANNs: large-scale pretrained audio neural networks for audio pattern recognition

    arXiv preprint arXiv:1912.10211. Cited by: §2, §2, §2, §5.1.4.
  • S. Latif, R. Rana, S. Khalifa, R. Jurdak, J. Qadir, and B. W. Schuller (2020) Deep representation learning in speech processing: challenges, recent advances, and future trends. External Links: 2001.00378 Cited by: §1, §5.1.2.
  • S. Latif, R. Rana, J. Qadir, and J. Epps (2018) Variational autoencoders for learning latent representations of speech emotion: a preliminary study. Interspeech 2018: Proceedings, pp. 3107–3111. Cited by: §2.
  • C. Lopes and F. Perdigao (2011) Phone recognition on the timit database. Speech Technologies/Book 1, pp. 285–302. Cited by: §7.3.
  • K. MacLean (2018) Voxforge. Ken MacLean.[Online]. Available: [Acedido em 2012]. Cited by: §3.1.2.
  • J. Michalek and J. Vaněk (2018) A survey of recent dnn architectures on the timit phone recognition task. In International Conference on Text, Speech, and Dialogue, pp. 436–444. Cited by: §7.3.
  • M. Michelashvili and L. Wolf (2019) Audio denoising with deep network priors. External Links: 1904.07612 Cited by: §5.1.3.
  • A. Nagrani, J. S. Chung, and A. Zisserman (2017) VoxCeleb: a large-scale speaker identification dataset. Interspeech 2017. External Links: Link, Document Cited by: §3.1.1, Table 2.
  • Z. Noorian, C. Pou-Prom, and F. Rudzicz (2017) On the importance of normative data in speech-based assessment. CoRR abs/1712.00069. External Links: Link, 1712.00069 Cited by: Table 2.
  • S. J. Pan and Q. Yang (2009) A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22 (10), pp. 1345–1359. Cited by: §2.
  • S. Parthasarathy and C. Busso (2018) Ladder networks for emotion recognition: using unsupervised auxiliary tasks to improve predictions of emotional attributes. Interspeech 2018. External Links: Link, Document Cited by: §2.
  • S. Pascual, M. Ravanelli, J. Serrà, A. Bonafonte, and Y. Bengio (2019) Learning Problem-Agnostic Speech Representations from Multiple Self-Supervised Tasks. In Proc. Interspeech 2019, pp. 161–165. External Links: Document, Link Cited by: §2, §2.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §5.2.1.
  • M. Plakal and D. Ellis (2020) YamNet. External Links: Link Cited by: §5.1.4.
  • S. Revay and M. Teschke (2019) Multiclass language identification using deep learning on spectral images of audio signals. CoRR abs/1905.04348. External Links: Link, 1905.04348 Cited by: §3.1.2, Table 2.
  • F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    Facenet: a unified embedding for face recognition and clustering


    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 815–823. Cited by: §4.
  • B. Schuller, S. Steidl, A. Batliner, J. Hirschberg, J. Burgoon, A. Baird, A. Elkins, Y. Zhang, E. Coutinho, and K. Evanini (2016) The INTERSPEECH 2016 computational paralinguistics challenge: deception, sincerity and native language. pp. 2001–2005. External Links: Document Cited by: §5.1.2.
  • C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu (2018) A survey on deep transfer learning. In International conference on artificial neural networks, pp. 270–279. Cited by: §2.
  • Y. Tian, C. Xu, and D. Li (2019) Deep audio prior. External Links: 1912.10292 Cited by: §5.1.3.
  • D. Ulyanov, A. Vedaldi, and V. Lempitsky (2018) Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9446–9454. Cited by: §5.1.3.
  • A. van den Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. CoRR abs/1807.03748. External Links: Link, 1807.03748 Cited by: §2.
  • B. Vlasenko, B. Schuller, A. Wendemuth, and G. Rigoll (2007) Combining frame and turn-level information for robust recognition of emotions within speech. pp. 2249–2252. Cited by: §5.2.3.
  • A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019) SuperGLUE: A stickier benchmark for general-purpose language understanding systems. CoRR abs/1905.00537. External Links: Link, 1905.00537 Cited by: §1, §2.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2018) GLUE: A multi-task benchmark and analysis platform for natural language understanding. CoRR abs/1804.07461. External Links: Link, 1804.07461 Cited by: §1.
  • P. Warden (2018) Speech commands: a dataset for limited-vocabulary speech recognition. External Links: 1804.03209 Cited by: §3.1.3.
  • J. Yosinski, J. Clune, Y. Bengio, and H. Lipson (2014) How transferable are features in deep neural networks?. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 3320–3328. External Links: Link Cited by: §2.
  • X. Zhai, J. Puigcerver, A. Kolesnikov, P. Ruyssen, C. Riquelme, M. Lucic, J. Djolonga, A. S. Pinto, M. Neumann, A. Dosovitskiy, L. Beyer, O. Bachem, M. Tschannen, M. Michalski, O. Bousquet, S. Gelly, and N. Houlsby (2019) The visual task adaptation benchmark. External Links: 1910.04867 Cited by: §1, §1, §2, §2, §7.1.
  • Z. Zhang, Y. Wang, C. Gan, J. Wu, J. B. Tenenbaum, A. Torralba, and W. T. Freeman (2020) Deep audio priors emerge from harmonic convolutional networks. In International Conference on Learning Representations, External Links: Link Cited by: §5.1.3.
  • Z. Zhang, B. Wu, and B. Schuller (2019) Attention-augmented end-to-end multi-task learning for emotion prediction from speech. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6705–6709. Cited by: §2.