Deep Representation Learning in Speech Processing: Challenges, Recent Advances, and Future Trends

01/02/2020 ∙ by Siddique Latif, et al. ∙ University of Southern Queensland 1

Research on speech processing has traditionally considered the task of designing hand-engineered acoustic features (feature engineering) as a separate distinct problem from the task of designing efficient machine learning (ML) models to make prediction and classification decisions. There are two main drawbacks to this approach: firstly, the feature engineering being manual is cumbersome and requires human knowledge; and secondly, the designed features might not be best for the objective at hand. This has motivated the adoption of a recent trend in speech community towards utilisation of representation learning techniques, which can learn an intermediate representation of the input signal automatically that better suits the task at hand and hence lead to improved performance. The significance of representation learning has increased with advances in deep learning (DL), where the representations are more useful and less dependent on human knowledge, making it very conducive for tasks like classification, prediction, etc. The main contribution of this paper is to present an up-to-date and comprehensive survey on different techniques of speech representation learning by bringing together the scattered research across three distinct research areas including Automatic Speech Recognition (ASR), Speaker Recognition (SR), and Speaker Emotion Recognition (SER). Recent reviews in speech have been conducted for ASR, SR, and SER, however, none of these has focused on the representation learning from speech—a gap that our survey aims to bridge.



There are no comments yet.


page 1

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The performance of machine learning (ML) algorithms heavily depends on data representation or features. Traditionally most of the actual ML research has focused on feature engineering or the design of pre-processing data transformation pipelines to craft representations that support ML algorithms [1]. Although such feature engineering techniques can help improve the performance of predictive models, the downside is that these techniques are labour-intensive and time-consuming. To broaden the scope of ML algorithms, it is desirable to make learning algorithms less dependent on hand-crafted features.

A key application of ML algorithms has been in analysing and processing speech. Nowadays, speech interfaces have become widely accepted and integrated into various real-life applications and devices. Services like Siri and Google Voice Search have become a part of our daily life and are used by millions of users [2]. Research in speech processing and analysis has always been motivated by a desire to enable machines to participate in verbal human-machine interactions. The research goals of enabling machines to understand human speech, identify speakers, and detect human emotion have attracted researchers’ attention for more than sixty years [3]. Researchers are now focusing on transforming current speech-based systems into the next generation AI devices that react with humans more friendly and provide personalised responses according to their mental states. In all these successes, speech representations—in particular, deep learning (DL)-based speech representations—play an important role. Representation learning, broadly speaking, is the technique of learning representations of input data, usually through the transformation of the input data, where the key goal is yielding abstract and useful representations for tasks like classification, prediction, etc. One of the major reasons for the utilisation of representation learning techniques in speech technology is that speech data is fundamentally different from two-dimensional image data. Images can be analysed as a whole or in patches but speech has to be studied sequentially to capture temporal contexts.

Traditionally, the efficiency of ML algorithms on speech has relied heavily on the quality of hand-crafted features. A good set of features often leads to better performance compared to a poor speech feature set. Therefore, feature engineering, which focuses on creating features from raw speech and has led to lots of research studies, has been an important field of research for a long time. DL models, in contrast, can learn feature representation automatically which minimises the dependency on hand-engineered features and thereby give better performance in different speech applications [4]

. These deep models can be trained on speech data in different ways such as supervised, unsupervised, semi-supervised, transfer, and reinforcement learning. This survey covers all these feature learning techniques and popular deep learning models in the context of three popular speech applications

[5]: (1) automatic speech recognition (ASR); (2) speaker recognition (SR); and (3) speech emotion recognition (SER).

Paper Representation Learning ASR SR SER Deep Learning Details
Bengio et al. [1]
This paper reviewed the work in the area of unsupervised feature learning
and deep learning, it also covered advancements in probabilistic models

and autoencoders. It does not include recent models like VAE and GANs.

Zhong et al.
In this paper, the history of data representation learning is reviewed from
traditional to recent DL methods. Challenges for representation learning,
recent advancement, and future trends are not covered.
Zhang et al. [7]
This paper provides a systematic overview of representative DL approaches
that are designed for environmentally robust ASR.
Swain et al. [8]
This paper reviewed the literature on various databases, features, and
classifiers for SER system.
Nassif et al. [9]
This paper presented a systematic review of studies from 2006 to 2018 on
DL based speech recognition and highlighted the on the trends of
research in ASR.
Our paper
Our paper covers different representation learning techniques from
speech, DL models, discusses different challenges, highlights recent
advancements and future trends. The main contribution of this paper is to
bring together scattered research on representation learning of speech
across three research areas: ASR, SR, and SER.
TABLE I: Comparison of our paper with that of the existing surveys.
Fig. 1: Organisation of the paper.

Despite growing interest in representation learning from speech, existing contributions are scattered across different research areas and a comprehensive survey is missing. To highlight this, we present the summary of different popular and recently published review papers Table I. The review article published in 2013 by Bengio et al. [1]

is one of the most cited papers. It is very generic and focuses on appropriate objectives for learning good representations, for computing representations (i. e., inference), and the geometrical connections between representation learning, manifold learning, and density estimation. Due to an earlier publication date, this paper had a focus on principal component analysis (PCA), restricted Boltzmann machines (RBMs), autoencoders (AEs) and recently proposed generative models were out of the scope of this paper. The research on representation learning has evolved significantly since then as generative models like variational autoencoders (VAEs)

[10], generative adversarial networks (GANs) [11], etc., have been found to be more suitable for representation learning compared to autoencoders and other classical methods. We cover all these new models in our review. Although, other recent surveys have focused on DL techniques for ASR [9, 7], SR [12], and SER [8], none of these has focused on representation learning from speech. This article bridges this gap by presenting an up-to-date survey of research that focused on representation learning in three active areas: ASR, SR, and SER. Beyond reviewing the literature, we discuss the applications of deep representation learning, present popular DL models and their representation learning abilities, and different representation learning techniques used in the literature. We further highlight the challenges faced by deep representation learning in the speech and finally conclude this paper by discussing the recent advancement and pointing out future trends. The structure of this article is shown in Figure 1.

Ii Background

Ii-a Traditional Feature Learning Algorithms

In the field of data representation learning, the algorithms are generally categorised into two classes: shallow learning algorithms and DL-based models [13]. Shallow learning algorithms are also considered as traditional methods. They aim to learn transformations of data by extracting useful information. One of the oldest feature learning algorithms, Principal Components Analysis (PCA) [14] has been studied extensively over the last century. During this period, various other shallow learning algorithms have been proposed based on various learning techniques and criteria, until the popular deep models in recent years. Similar to PCA, Linear Discriminant Analysis (LDA) [15] is another shallow learning algorithm. Both PCA and LDA are linear data transformation techniques, however, LDA is a supervised method that requires class labels to maximise class separability. Other linear feature learning methods includes Canonical Correlation Analysis (CCA) [16], Multi-Dimensional Scaling (MDS) [17]

, and Independent Component Analysis (ICA)

[18]. The kernel version of some linear feature mapping algorithms are also proposed including kernel PCA (KPCA) [19], and generalised discriminant analysis (GDA) [20]. They are non-linear versions of PCA and LDA, respectively. Another popular technique is Non-negative Matrix Factorisation (NMF) [21] that can generate sparse representations of data useful for ML tasks.

Many methods for nonlinear feature reduction are also proposed to discover the non-linear hidden structure from the high dimensional data

[22]. They include Locally Linear Embedding (LLE) [23], Isometric Feature Mapping (Isomap) [24], T-distributed Stochastic Neighbour Embedding (t-SNE) [25]

, and Neural Networks (NNs)

[26]. In contrast to kernel-based methods, non-linear feature representation algorithms directly learn the mapping functions by preserving the local information of data in the low dimensional space. Traditional representation algorithms have been widely used by researchers of the speech community for transforming the speech representations to more informative features having low dimensional space (e.g., [27, 28]). However, these shallow feature learning algorithms dominate the data representation learning area until the successful training of deep models for representation learning of data by Hinton and Salakhutdinov in 2006 [29]. This work was quickly followed up with similar ideas by others [30, 31], which lead to a large number of deep models suitable for representation learning. We discuss the brief history of the success of DL in speech technology next.

Ii-B Brief History on Deep Learning (DL) in Speech Technology

For decades, the Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM) based models (GMM-HMM) ruled the speech technology due to their many advantages including their mathematical elegance and capability to model time-varying sequences

[32]. Around 1990, discriminative training was found to produce better results compared to the models trained using maximum likelihood [33]

. Since then researchers started working towards replacing GMM with a feature learning models including neural networks (NNs), restricted Boltzmann machines (RBMs), deep belief networks (DBNs), and deep neural networks (DNNs)

[34]. Hybrid models gained popularity while HMMs continued to be investigated.

In the meanwhile, researchers also worked towards replacing HMM with other alternatives. In 2012, DNNs were trained on thousands of hours of speech data and they successfully reduced the word error rate (WER) on ASR task [35]

. This is due to their ability to learn a hierarchy of representations from input data. However, soon after recurrent neural networks (RNNs) architectures including long-short term memory (LSTM) and gated recurrent units (GRUs) outperformed DNNs and became state-of-the-art models not only in ASR

[36] but also in SER [37]. The superior performance of RNN architectures was because of their ability to capture temporal contexts from speech [38, 39]

. Later, a cascade of convolutional neural networks (CNNs), LSTM and fully connected (DNNs) layers were further shown to outperform LSTM-only models by capturing more discriminative attributes from speech

[40, 41]. The lack of labelled data set the pace for the unsupervised representation learning research. For unsupervised representation from speech, AEs, RBMs, and DBNs were widely used [42].

Nowadays, there has been a significant interest in three classes of generative models including VAEs, GANs, and deep auto-regressive models [43, 44]. They have been widely being employed for speech processing—especially VAEs and GANs are becoming very influential models for learning speech representation in an unsupervised way [45, 46]. In speech analysis tasks, deep models for representation learning can either be applied to speech features or directly on the raw waveform. We present a brief history of speech features in the next section.

Ii-C Speech Features

In speech processing, feature engineering and designing of models for classification or prediction are often considered as separate problems. Feature engineering is a way of manually designing speech features by taking advantage of human ingenuity. For decades, Mel Frequency Cepstral Coefficients (MFCCs) [47]

have been used as the principal set of features for speech analysis tasks. Four steps involve for MFCCs extraction: computation of the Fourier transform, projection of the powers of the spectrum onto the Mel scale, taking the logarithm of the Mel frequencies, and applying Discrete Cosine Transformation (DCT) for compressed representations. It is found that the last step removes the information and destroys spatial relations; therefore, it is usually omitted, which yields the log-Mel spectrum, a popular feature across the speech community. This has been the most popular feature to train DL networks.

The Mel-filter bank is inspired by auditory and physiological findings of how humans perceive speech signals [48]. Sometimes, it becomes preferable to use features that capture transpositions as translations. For this, a suitable filter bank is spectrograms that captures how the frequency content of the speech signal changes with time [49]. In speech research, researchers widely used CNNs for spectrogram inputs due to their image like configuration. Log-Mel spectrograms is another speech representation that provides a compact representation and became the current state of the art because models using these features usually need less data and training to achieve similar or better results.

In SER, feature engineering is more dominant and a minimalistic sets of features like GeMAPs and eGeMAPs [50] are also proposed based on affective physiological changes in voice production and their theoretical significance [50]. They are also popular being used as benchmark feature sets. However, in speech analysis tasks, some works [51, 52] show that the particular choice of features is less important compared to the design of the model architecture and the amount of training data. The research is continuing in designing such DL models and input features that involve minimum human knowledge.

Ii-D Databases

Although the success of deep learning is usually attributed to the models’ capacity and higher computational power, the most crucial role is played by the availability of large-scale labelled datasets [53]. In contrast to the vision domain, the speech community started using DNNs with considerably smaller datasets. Some popular conventional corpora used for ASR and SR includes TIMIT [54], Switchboard [55], WSJ [56], AMI [57]. Similarly, EMO-DB [58], FAU-AIBO [59], RECOLA [60], and GEMEP [61] are some popular classical datasets. Recently, larger datasets are being created and released to the research community to engage the industry as well as the researchers. We summarise some of these recent and publicly available datasets in Table II that are widely used in the speech community.

Application Corpus Language Mode Size Details
Speech and Speaker Recognition
LibriSpeech [62]
1 000 hours of speech
of 2 484 speakers
Designed for speech recognition and also used for speaker identification
and verification.
VoxCeleb2 [63]
1 128 246 utterances
of 6 112 celebrities
This data is extracted from videos uploaded to YouTube and designed for
speaker identification and verification.
118 hours of speech
of 698 speakers
This corpus is extracted from 818 TED Talks for ASR.
THCHS-30 [65]
30 hours of speech
from 30 speakers
This corpus is recorded for Chinese speech recognition.
AISHELL-1 [66]
170 hours of speech
from 400 speakers

An open source Mandarin ASR corpus.

Tuda-De [67]
127 hour of speech
from 147 speakers
A corpus of German utterances was publicly released for distant
speech recognition.
Speech Emotion Recogntion
EMO-DB [58]
10 actors and
494 utterances
An acted corpus on 10 German sentences which which usually
used in everyday communication.
150 participants and
959 conversations
An induced corpus recorded to build sensitive artificial listener agents
that can engage a person in a sustained and emotionally coloured conversation.
12 hours of speech
from 10 speakers
To collect this data, an interactive setting served to elicit authentic emotions
and create a larger emotional corpus to study multimodal interactions.
9 hours of audiovisual
data of 12 actors
This corpus is recorded from dyadic interactions of actors to
study emotions.
TABLE II: Speech corpora and their details.

Ii-E Evaluations

Evaluation measures vary across speech tasks. The performance of ASR systems is usually measured using word error rates (WER), which is the fraction of the sum of insertion, deletion, and substitution divided by the total number of words in the reference transcription. In speaker verification systems, two types of errors—namely, false rejection (fr), where a valid identity is rejected, and false acceptance

(fa), where a fake identity is accepted—are used. These two errors are measured experimentally on test data. Based on these two errors, a detection error trade-offs (DETs) curve is drawn to evaluate the performance of the system. DET is plotted using the probability of false acceptance

as a function of the probability of false rejection . Another popular evaluation measure is the equal error rate (EER) which corresponds to the operating point where = . Similarly, the area under curve (AUC) of the receiver operating curve (ROC) is often found. The details on other evaluation measures for the speaker verification task can be found in [71]. Both speaker identification and emotion recognition use classification accuracy as a metric. However, as data is often imbalanced across the classes in naturalistic emotion corpora, the accuracy is usually used as so-called unweighted accuracy (UA) or unweighted average recall (UAR), which represents the average recall across classes, unweighted by the number of instances by classes. This has been introduced by the first challenge in the field—the Interspeech 2009 Emotion Challenge [72] and has since been picked up by other challenges across the field. Also, SER systems use regression to predict emotional attributes such as continuous arousal and valence or dominance.

Iii Applications of Deep Representation Learning

Learning representations is a fundamental problem in AI and it aims to capture useful information or attributes of data, where deep representation learning involves DL models for this task. Various applications of deep representation learning have been summarised in Figure 2.

Fig. 2: Applications of deep representation learning.

Iii-a Automatic Feature Learning

Automatic feature learning is the process of constructing explanatory variables or features that can be used for classification or prediction problems. Feature learning algorithms can be supervised or unsupervised [73]. Deep learning (DL) models are composed of multiple hidden layers and each layer provides a kind of representation of the given data [74]. It has been found that automatically learnt feature representations are – given enough training data – usually more efficient and repeatable than hand-crafted or manually designed features which allow building better faster predictive models [6]

. Most importantly, automatically learnt feature representation is in most cases more flexible and powerful and can be applied to any data science problem in the fields of vision processing

[75], text processing [76], and speech processing [77].

Iii-B Dimension Reduction and Information Retrieval

Broadly speaking, dimensionality reduction methods are commonly used for two purposes: (1) to eliminate data redundancy and irrelevancy for higher efficiency and often increased performance , and (2) to make the data more understandable and interpretable by reducing the number of input variables [6]. In some applications, it is very difficult to analyse the high dimensional data with a limited number of training samples [78]. Therefore, dimension reduction becomes imperative to retrieve important variables or information relevant to the specified problems. It has been validated that the use of more interpretable features in a lower dimension can provide competitive performance or even better performance when used for designing predictive models [79].

Information retrieval is a process of finding information based on a user query by examining a collection of data [80]. The queried material can be text, documents, images, or audio, and users can express their queries in the form of a text, voice, or image [81, 82]. Finding a suitable representation of a query to perform retrieval is a challenging task and DL-based representation learning techniques are playing an important role in this field. The major advantages of using representation learning models for information retrieval is that they can learn features automatically with little or no prior knowledge [83].

Iii-C Data Denoising

Despite the success of deep models in different fields, these models remain brittle to the noise [84]. To deal with noisy conditions, one often performs data augmentation by adding artificially-noised examples to the training set [85]. However, data augmentation may not help always, because the distribution of noise is not always known. In contrast, representation learning methods can be effectively utilised to learn noise-robust features learning and they often provide better results compared to data augmentation [86]. In addition, the speech can be denoised such as by DL-based speech enhancement systems [87].

Iii-D Clustering Structure

Clustering is one of the most traditional and frequently used data representation methods. It aims to categorise similar classes of data samples into one cluster using similarity measures (e. g., Euclidean distance). A large number of data clustering techniques have been proposed [88]. Classical clustering methods usually have poor performance on high-dimensional data, and suffer from high computational complexity on large-scale datasets [89]. In contrast, DL-based clustering methods can process large and high dimensional data (e. g., images, text, speech) with a reasonable time complexity and they have emerged as effective tools for clustering structures [89].

Iii-E Disentanglement and Manifold Learning

Disentangled representation is a method that disentangles or represents each feature into narrowly defined variables and encodes them as separate dimensions [1]

. Disentangled representation learning differs from other feature extraction or dimensionality reduction techniques as it explicitly aims to learn such representations that aligns axes with the generative factors of the input data

[90, 91]. Practically, data is generated from independent factors of variation. Disentangled representation learning aims to capture these factors by different independent variables in the representation. In this way, latent variables are interpretable, generalisable, and robust against adversarial attacks [92].

Manifold learning aims to describe data as low-dimensional manifolds embedded in high-dimensional spaces [93]. Manifold learning can retain a meaningful structure in very low dimensions compared to linear dimension reduction methods [94]. Manifold learning algorithms attempt to describe the high dimensional data as a non-linear function of fewer underlying parameters by preserving the intrinsic geometry [95, 96]

. Such parameters have a widespread application in pattern recognition, speech analysis, and computer vision


Iii-F Abstraction and Invariance

The architecture of DNNs is inspired by the hierarchical structure of the brain [98]. It is anticipated that deep architectures might capture abstract representations [99]. Learning abstractions is equivalent to discovering a universal model that can be across all tasks to facilitate generalisation and knowledge transfer. More abstract features are generally invariant to the local changes and are non-linear functions of the raw input [100]

. Abstraction representations also capture high-level continuous-valued attributes that are only sensitive to some very specific types of changes in the input signal. Learning such sorts of invariant features has more predictive power which has always been required by the artificial intelligence (AI) community


Iv Representation Learning Architectures

In 2006, DL-based automatic feature discovery was initiated by Hinton and his colleagues [29] and followed up by other researchers [30, 31]. This has led to a breakthrough in representation learning research and many novel DL models have been proposed. In this section, we will discuss these models and highlight the mechanics of representation learning using them.

Iv-a Deep Neural Networks (DNNs)

Historically, the idea of deep neural networks (DNNs) is an extension of ideas emerging from research on artificial neural networks (ANNs) [102]

. Feed Forward Neural Networks (FNNs) or Multilayer Perceptrons (MLPs)


with multiple hidden layers are indeed a good example of deep architectures. DNNs consist of multiple layers, including an input layer, hidden layers, and an output layer, of processing units called “neurons”. These neurons in each layer are densely connected with the neurons of the adjacent layers. The goal of DNNs is to approximate some function

. For instance, a DNN classifier maps an input to a category label by using a mapping function and learns the value of the parameters that result in the best function approximation. Each layer of a DNN performs representation learning based on the input provided to it. For example, in case of a classifier, all hidden layers except the last layer (softmax

) learn a representation of input data to make classification task easier. A well trained DNN network learns a hierarchy of distributed representations

[74]. Increasing the depth of DNNs promotes reusing of learnt representations and enables the learning of a deep hierarchy of representations at different levels of abstraction. Higher levels of abstraction are generally associated with invariance to local changes of the input [1]. These representations proved very helpful in designing different speech-based systems.

Iv-B Convolutional Neural Networks (CNNs)

Convolutional neural networks (CNNs) [104] are a specialised kind of deep architecture for processing of data having a grid-like topology. Examples include image data that have 2D grid pixels and time-series data (i. e., 1D grid) having samples at regular intervals of time to create a grid-like structure. CNNs are a variant of the standard FNNs. They introduce convolutional and pooling layers into the structure of DNNs, which take into account the spatial representations of the data and make the network more efficient by introducing sparse interactions, parameter sharing, and equivariant representations. The convolution operation in the convolution layer is the fundamental building block of CNNs. It consists of several learnable kernels that are convolved with the input to compute the output feature map. This operation is defined as:


where represents the element for the output feature map, is the input feature maps, and represent the filter and bias, respectively. The symbol denotes the 2D convolution operation. After the convolution operation, a pooling operation is applied, which facilitates nonlinear downsampling of the feature map and makes the representations invariant to small translations in the input [73]. Finally, it is common to use DNN layers to accumulate the outputs from the previous layers to yield a stochastic likelihood representation for classification or regression.

In contrast to DNNs, the training process of CNNs is easy due to fewer parameters [105]. CNNs are found very powerful in extracting low-level representations at the initial layers and high-level features (textures and semantics) in the higher layers [106]. The convolution layer in CNNs acts as data-driven filterbank that is able to capture representations from speech [107] that are more generalised [108], discriminative [106], and contextual [109].

Iv-C Recurrent Neural Networks (RNNs)

Recurrent neural networks (RNNs) [110] are an extension of FNNs by introducing recurrent connections within layers. They use the previous state of the model as additional input at each time step which creates a memory in its hidden state having information from all previous inputs. This makes RNNs to have stronger representational memory compared to hidden Markov models (HMMs), whose discrete hidden states bound their memory [111]. Given an input sequence at the current time step , they calculates the hidden state using the previous hidden state

and outputs a vector sequence

. The standard equations for RNNs are given below:


where terms are the weight matrices (i. e., is a weight matrix of an input-hidden layer),

is the bias vector, and


denotes the hidden layer function. Simple RNNs usually fail to model the long-term temporal contingencies due to the vanishing gradient problem. To deal with this problem, multiple specialised RNN architectures exist including long short-term memory (LSTM)

[112] and gated recurrent units (GRUs) [111] with gating mechanism to add and forget the information selectively. Bidirectional RNNs [113] were proposed to model future context by passing the input sequence through two separate recurrent hidden layers. These separate recurrent layers are connected to the same output to access the temporal context in both directions to model both past and future.

RNNs introduce recurrent connections to allow parameters to be shared across time which makes them very powerful in learning temporal dynamics from sequential data (e. g., audio, video). Due to these abilities, RNNs especially LSTMs have had an enormous impact in speech community and they are incorporated in state-of-the-art ASR systems [114].

Iv-D Autoencoders (AEs)

The idea of an autoencoding network [115] is to learn a mapping from high-dimensional data to a lower-dimensional feature space such that the input observations can be approximately reconstructed from the lower-dimensional representation. The function is called the encoder that maps the input vector into feature/representation vector . The decoder network is responsible to map a feature vector to reconstruct the input vector . The decoder network parameterises the decoder function . Overall, the parameters are optimised by minimising the following cost function:


The set of parameters of the encoder and decoder networks are simultaneously learnt by attempting to incur a minimal reconstruction error. If the input data have correlated structures; then, the autoencoders (AEs) can learn some of these correlations [78]. To capture useful representations , the cost function of Equation 4 is usually optimised with an additional constraint to prevent the AE from learning the useless identity function having zero reconstruction error. This is achieved through various ways in the different forms of AEs, as discussed below in more detail.

Iv-D1 Undercomplete Autoencoders (AEs)

One way of learning useful feature representations

is to regularise the autoencoder by imposing constraints to have a low dimensional feature size. In this way, the AE is forced to learn the salient features/representations of data from high dimensional space to a low dimensional feature space. If an autoencoder uses a linear activation function with the mean squared error criterion; then, the resultant architecture will become equivalent to the PCA algorithm, and its hidden units will learn the principal components of input data

[1]. However, an autoencoder with non-linear activation functions can learn a more useful feature representation compared to PCA [29].

Iv-D2 Sparse Autoencoders (AEs)

An AE network can also discover a useful feature representation of data, even when the size of the feature representations is larger than the input vector [78]. This is done by using the idea of sparsity regularisation [31] by imposing an additional constraint on the hidden units. Sparsity can be achieved either by penalising hidden unit biases [31]

or the outputs’ hidden unit, however, it hurts numerical optimisation. Therefore, imposing sparsity directly on the outputs of hidden units is very popular and has several variants. One way to realise a sparse AEs is to incorporate an additional term in the loss function to penalise the KL divergence between average activation of the hidden unit and the desired sparsity (

) [116]. Let us consider as the activation of a hidden unit for a given input ; then, the average activation over the training set is given by:


where is the number of training samples. Then, the cost function of a sparse autoencoder will become:


where is the cost function of the standard autoencoder. Another way to penalise a hidden unit is to use as penalty by which the following objective becomes:


Sparseness plays a key role in learning a more meaningful representation of input data [116]

. It has been found that sparse AEs are simple to train and can learn better representation compared to denoising autoencoders (DAE) and RBMs

[117]. In particular, sparse encoders can learn useful information and attributes from speech, which can facilitate better classification performance [118].

Iv-D3 Denoising Autoencoders (DAEs)

Denoising autoencoders (DAEs) are considered as a stochastic version of the basic AE. They are trained to reconstruct a clean input from its corrupted version [119]. The objective function of a DAE is given by:


where is the corrupted version of , which is done via stochastic mapping . During training, DAEs are still minimising the same reconstruction loss between a clean and its reconstruction from . The difference is that is learnt by applying a deterministic mapping to a corrupted input . It thus learns higher level feature representations that are robust to the corruption process. The features learnt by a DAE are reported qualitatively better for tasks like classification and also better than RBM features [1].

Iv-D4 Contractive Autoencoders (CAEs)

Contractive autoencoders (CAEs) proposed by Rifai et al. [120] with the motivation to learn robust representations are similar to a DAEs. CAEs are forced to learn useful representations that are robust to infinitesimal input variations. This is achieved by adding an analytic contractive penalty to Equation 4. The penalty term is the Frobenius norm of the Jacobian matrix of the hidden layer with respect to the input . The loss function for a CAE is given by:


where is the number of hidden units, and is the activation of hidden unit .

Iv-E Deep Generative Models

Generative models are powerful in learning the distribution of any kind of data (audio, images, or video) and aim to generate new data points. Here, we discuss four generative models due to their popularity in the speech community.

Iv-E1 Boltzmann Machines and Deep Belief Networks

Deep Belief Networks (DBNs) [121] are a powerful probabilistic generative model that consists of multiple layers of stochastic latent variables, where each layer is a Restricted Boltzmann Machine (RBM) [122]

. Boltzmann Machines (BM) are a bipartite graph in which visible units are connected to hidden units using undirected connections with weights. A BM is restricted in the sense that there are no hidden-hidden and visible-visible connections. A RBM is an energy-based model whose joint probability distribution between visible layer (

) and hidden layer () is given by:


is the normalising constant also known as the partition function, and is an energy function defined by the following equation:


where and are the binary states of hidden and visible units. are the weights between visible and hidden nodes. and represent the bias terms for visible and hidden units respectively.

During the training phase, an RBM uses Markov Chain Monte Carlo (MCMC)-based algorithms

[121] to maximise the log-likelihood of the training data. Training based on MCMC computes the gradient of the log-likelihood, which poses a significant learning problem [123], Moreover, DBNs are trained using layer-wise training that is also computationally inefficient. In recent years, generative models like GANs and VAEs have been proposed that can be trained via direct back-propagation and avoid the difficulties of MCMC based training. We discuss GANs and VAEs in more detail next.

Model Characteristics Applications in Speech Processing References
Feature Learning
Good for learning a hierarchy of representations. They can learn
invariant and discriminative representations. Features learnt
by DNNs are more generalised compared to traditional methods.
Originated from image recognition and were also extended for NLP
and speech. They can learn a high-level abstraction from speech.
Good for sequential modelling. They can learn temporal structures
from speech and outperformed DNNs
Powerful unsupervised representation learning models that
encode the data in sparse and compress representations.
Stochastic variational inference and learning model. Popular
in learning disentangled representations from speech.
Game-theoretical framework and very powerful for data generation
and robust to overfitting. They can learn disentangled representation
that are very suitable for speech analysis.
TABLE III: Summary of some popular representation learning models.

Iv-E2 Generative Adversarial Networks (GANs)

Generative adversarial networks (GANs) [11] use adversarial training to directly shape the output distribution of the network via back-propagation They include two neural networks—a generator, , and a discriminator, , which play a min-max adversarial game defined by the following optimisation program:


The generator, , maps the latent vectors, , drawn from some known prior, (e. g., Gaussian), to fake data points, . The discriminator, , is tasked with differentiating between generated samples (fake), , and real data samples, , (drawn from data distribution, ). The generator network, , is trained to maximally confuse the discriminator into believing that samples it generates come from the data distribution. This makes the GANs very powerful. They have become very popular and are being exploited in various ways by speech community either for speech synhtesising or to augment the training material by generated feature observations or speech itself. Researchers proposed various other architecture on the idea of GAN. These models include conditional GAN [126], BiGAN [127], InfoGAN [128], etc. These days, GAN based architectures are widely being used for representation learning not only from images but also from speech and related fields.

Iv-E3 Variational Autoencoders

Variational Autoencoders (VAEs) are probabilistic models that use a stochastic encoder for modelling the posterior distribution , and a generative network (decoder) that models the conditional log-likelihood . Both of these networks are jointly trained to maximise the following variational lower bound on the data loglikelihood:


The first term is the standard reconstruction term of an AE and the second term is the KL divergence between the prior and the posterior distribution . The second term acts as a regularisation term and without it, the model is simply a standard autoencoder. VAEs are becoming very popular in learning representation from speech. Recently, various variants of VAEs are proposed in the literature, which include -VAE [129], InfoVAE [130], PixelVAE [131], and many more [132]. All these VAEs are very powerful in learning disentangled and hierarchical representations and are also popular in clustering multi-category structures of data [132].

Iv-E4 Autoregressive Networks (ANs)

Autoregressive networks (ANs) are directed probabilistic models with no latent random variables. They model the joint distribution of high-dimensional data as a product of conditional distributions using the following probabilistic chain-rule:


where is the variable of and

are the parameters of the AN model. The conditional probability distributions in ANs are usually modelled with a neural network that receives

as input and outputs a distribution over possible . Some of the popular ANs includes PixelRNN [43], PixelCNN [133], and WaveNet [44]. ANs are powerful density estimators and they capture details over global data without learning a hierarchical latent representation unlike latent variable models such as GANs, VAEs, etc. In speech technology, WaveNet is very popular and has powerful acoustic modelling capabilities. They are used for speech synthesis [44], denoising [134], and also in unsupervised representation learning setting in conjunction with VAEs [135].

In this section, we have discussed DL models that use representation learning of speech. In Table III we highlight the key characteristics of DL models in terms of their representation learning abilities. All these models can be trained in different ways to learn useful representation from speech, which we have reviewed in the next section.

V Techniques for Representation Learning From Speech

Deep models can be used in different ways to automatically discover suitable representations for the task at hand, and this section covers these techniques for learning features from speech for ASR, SR, and SER. Figure 3 shows the different learning techniques that can be used to capture information from data. These techniques have different important attributes that we highlight in Table IV.

Fig. 3: Representation Learning Techniques.
Supervised Learning
Unsupervised Learning
Semi-Supervised Learning
Transfer Learning
Reinforcement Learning
    Learn explicitly
    Data with labels
    Direct feedback is given
    Predict outcome/future
    No exploration
    Learn patterns and
    Data without labels
    No direct feedback
    No prediction
    No exploration
    Blend on both supervised
and unsupervised
    Data with and without labels
    Direct feedback is given
    Predict outcome/future
    No exploration
    Transfer knowledge from
one supervised task to other
    Labelled data for
different task
    Direct feedback is given
    Predict outcome/future
    No exploration
    Reward based learning
    Policy making with
    Predict outcome/future
    Adaptable to changes
through exploration
TABLE IV: Comparison of different types of representation learning.

V-a Supervised Learning

Deep learning (DL) models can learn representations from data in both unsupervised and supervised manners. In the supervised case, features representations are learnt on datasets by considering label information. In the speech domain, supervised representation learning methods are widely employed for feature learning from speech. RBMs and DBNs are found very capable in learning features from speech for different tasks including ASR [52, 136], speaker recognition [137, 138], and SER [139]. Particularly, DBNs trained in a greedy layer-wise training [121] can learn usef ul features from the speech signal [140].

Convolutional neural networks (CNNs) [104] are another popular supervised model that is widely used for feature extraction from speech. They have shown very promising results for speech and speaker recognition tasks by learning more generalised features from raw speech compared to ANNs and other feature-based approaches [107, 108, 141]. After the success of CNNs in ASR, researchers also attempted to explore them for SER [41, 142, 143, 106], where they used CNNs in combination with LSTM networks for modelling long term dependencies in an emotional speech. Overall, it has been found that LSTMs (or GRUs) can help CNNs in learning more useful features from speech [144, 40].

Despite the promising results, the success of supervised learning is limited by the requisite of transcriptions or labels for speech-related tasks. It cannot exploit the plethora of freely available unlabelled datasets. It is also important to note that the labelling of these datasets is very expensive in terms of time and resources. To tackle these issues, unsupervised learning comes into play to learn representations from unlabelled data. We are discussing the potentials of unsupervised learning in the next section.

V-B Unsupervised Learning

Unsupervised learning facilitates the analysis of input data without corresponding labels and aims to learn the underlying inherent structure or distribution of data. In the real world, data (speech, image, text) have extremely rich structures and algorithms trained in an unsupervised way to create understandings of data rather than learning for particular tasks. Unsupervised representation learning from large unlabelled datasets is an active area of research. In the context of speech analysis, it can exploit the practically available unlimited amount of unlabelled corpora to learn good intermediate feature representations, which can then be used to improve the performance of a variety of supervised tasks such as speech emotion recognition [145].

Regarding unsupervised representation learning, researchers mostly utilised variants of autoencoders (AEs) to learn suitable features from speech data. AEs can learn high-level semantic content (e. g., phoneme identities) that are invariant to confounding low-level details (pitch contour or background noise) in speech [135].

In ASR and SR, most of the studies utilised VAEs for unsupervised representation learning from speech [135, 146]. VAEs can jointly learn a generative model and an inference model, which allows them to capture latent variables from observed data. In [46], the authors used FHVAE to capture interpretable and disentangled representations from speech without any supervision. They evaluated the model on two speech corpora and demonstrated that FHVAE can satisfactorily extract linguistic contents from speech and outperform an i-vector baseline speaker verification task while reducing WER for ASR. Other autoencoding architectures like DAEs are also found very promising in finding speech representations in an unsupervised way. Most importantly, they can produce robust representation for noisy speech recognition [147, 148, 149].

Similarly, classical models like RBMs have proved to be very successful for learning representation from speech. For instance, Jaitly and Hinton used RBMs for phoneme recognition in [52], and showed that RBMs can learn more discriminative features that achieved better performance compared to MFCCs. Interestingly, RBMs can also learn filterbanks from raw speech. In [150], Sailor and Patil used a convolutional RBM (ConvoRBM) to learn auditory-like sub-band filters from the raw speech signal. The authors showed that unsupervised deep auditory features learnt by ConvoRBM can outperform using Mel filterbank features for ASR. Similarly, DBNs trained on features such as MFCCs [151] or Mel scale filter banks [140] create high-level feature representations.

Similar to ASR and SR, models including AEs, DAEs, and VAEs are mostly used for unsupervised representation learning. In [152], Ghosh et al. used stacked AEs for learning emotional representations from speech. They found that stacked AE can learn highly discriminative features from the speech that are suitable for the emotion classification task. Other studies [153, 154] also used AEs for capturing emotional representation from speech and found they are very powerful in learning discriminative features. DAEs are exploited in [155, 156] to show the suitability of DAEs for SER. In [79], the authors used VAEs for learning latent representations of speech emotions. They showed that VAEs can learn better emotional representations suitable for classification in contrast to standard AEs.

As outlined above, recently, adversarial learning (AL) is becoming very popular in learning unsupervised representation form speech. It involves more than one network and enables the learning in an adversarial way, which enables to learn more discriminative [157] and robust [158] features. Especially GANs [159], adversarial autoencoders (AAEs) [160] and other AL [161] based models are becoming popular in modelling speech not only in ASR but also SR and SER.

Despite all these successes, the performance of a representation learnt in an unsupervised way is generally harder to compare with supervised methods. Semi-supervised representation learning techniques can solve this issue by simultaneously utilising both labelled and unlabelled data. We discuss semi-supervised representation learning methods in the next section.

V-C Semi-supervised Learning

The success in DL has predominately been enabled by key factors like advanced algorithms, processing hardware, open sharing of codes and papers, and most importantly the availability of large-scale labelled datasets and pre-trained networks on these , e. g., ImageNet. However, a large labelled database or pre-trained network for every problem like speech emotion recognition is not always available

[162, 163, 164]. It is very difficult, expensive, and time-consuming to annotate such data as it requires expert human efforts [165]. Semi-supervised learning solves this problem by utilising the large unlabelled data, together with the labelled data to build better classifiers. It reduces human efforts and provides higher accuracy, therefore, semi-supervised models are of great interest both in theory and practice [166].

Semi-Supervised learning is very popular in SER and researchers tried various models to learn emotional representation from speech. Huang et al. [167] used CNN in semi-supervised for capturing affect-salient representations and reported superior performance compared to well-known hand-engineered features. Ladder network-based semi-supervised methods are very popular in SER and used in [168, 169, 170]. A ladder network is an unsupervised DAE that is trained along with a supervised classification or regression task. It can learn more generalised representations suitable for SER compared to the standard methods. Deng et al. [171] proposed a semi-supervised model by combining an AE and a classifier. They considered unlabelled samples from unlabelled data as an extra garbage class in the classification problem. Features learnt by a semi-supervised AE performed better compared to an unsupervised AE. In [165], the authors trained an AAE by utilising the additional unlabelled emotional data to improve SER performance. They showed that additional data help to learn more generalised representations that perform better compared to supervised and unsupervised methods.

In ASR, semi-supervised learning is mainly used to circumvent the lack of sufficient training data by creating features fronts ends [172], by using multilingual acoustic representations [173], and by extracting an intermediate representation from large unpaired datasets [174] to improve the performance of the system. In SR, DNNs were used to learn representations for both the target speaker and interference for speech separation in a semi-supervised way [175]. Recently, a GAN based model is exploited for a speaker diarisation system with superior results using semi-supervised training [176].

V-D Transfer Learning

Transfer learning (TL) involves methods that utilise any knowledge resources (i.e., data, model, labels, etc.) to increase model learning and generalisation for the target task [177]. The idea behind TL is “Learning to Learn”, which specifies that learning from scratch (tabula rasa learning) is often limited, and experience should be used for deeper understanding [178]. TL encompasses different approaches including multitask learning (MTL), model adaptation, knowledge transfer, covariance shift, etc. In the speech processing field, representation learning gained much interest in these approaches of TL. In this section, we cover three popular TL techniques that are being used in today’s speech technology including domain adaptation, multi-task learning, and self-taught learning.

V-D1 Domain Adaptation

Deep domain adaptation is a sub-field of TL and it has emerged to address the problem of unavailability of labelled data. It aims to eliminate the training-testing mismatch. Speech is a typical example of heterogeneous data and a mismatch always exists between the probability distributions of source and target domain data, which can degrade the performance of the system [179]. To build more robust systems for speech-related applications in real-life, domain adaptation techniques are usually applied in the training pipeline of deep models to learn representations that explicitly minimise the difference between the source and target domains.

Researchers have attempted different methods of domain adaptation using representation learning to achieve robustness under noisy conditions in ASR systems. In [179], the authors used DNN based unsupervised representation method to eliminate the difference between the training and the testing data. They evaluate the model with clean training data and noisy test data and found that relative error reduction is achieved due to the elimination of mismatch between the distribution of train and test data. Another study [180] explored the unsupervised domain adaptation of the acoustic model by learning hidden unit contributions. The authors evaluated the proposed adaptation method on four different datasets and achieved improved results compared to unadapted methods. Hsu et al. [46] used a VAE based domain adaptation method to learn a latent representation of speech and create additional labelled training data (source domain) having a distribution similar to the testing data (target domain). The authors were able to reduce the absolute word error rate (WER) by 35 % in contrast to a non-adapted baseline. Similarly, in [181] domain invariant representations are extracted using Factorised Hierarchical Variational Autoencoder (FHVAE) for robust ASR. Also, some studies [182, 183] explored unsupervised representation learning-based domain adaptation for distant conversational speech recognition. They found that a representations-learning-based approach outperformed unadapted models and other baselines. For unsupervised speaker adaptation, Fan et al. [184]

used multi-speaker DNNs to takes advantage of shared hidden representation and achieved improved results.

Many researchers exploited DNN models for learning transferable representations in multi-lingual ASR [185, 186]. Cross-lingual transfer learning is important for the practical application, and it has been found that learnt features can be transferred to improve the performance of both resource-limited and resource-rich languages [172]. The representations learnt in this way are referred to as bottleneck features and these can be used to train models for languages even without any transcriptions [187]. Recently, adversarial learning of representations for domain adaptation methods is becoming very popular. Researchers trained different adversarial models and were able to improve the robustness against noise [188], adaptation of acoustic models for accented speech [189, 190], gender variabilities [191], and speaker and environment variabilities [192, 193, 194, 195]. These studies showed that representation learning using the adversarially trained models can improve the ASR performance on unseen domains.

In SR, Shon et al. used DAE to minimise the mismatch between the training and testing domain by utilising out-of-domain information [196]. Interestingly, domain adversarial training is utilised by Wang et al. [197] to learn speaker-discriminative representations. Authors empirically showed that the adversarial training help to solve dataset mismatch problem and outperform other unsupervised domain adaptation methods. Similarly, a GAN is recently utilised by Bhattacharya et al. [198] to learn speaker embeddings for a domain robust end-to-end speaker verification system. They achieved significantly better results over the baseline.

In SER, domain adaptation methods are also very popular to enable the system to learn representations that can be used to perform emotion identification across different corpora and different languages. Deng et al. [199] used AE with shared hidden layers to learn common representations for different emotional datasets. These authors were able to minimise the mismatch between different datasets and able to increase the performance. In another study [200], the authors used a Universum AE for cross-corpus SER. They were able to learn more generalised representations using the Universum AE, which achieves promising results compared to standard AEs. Some studies exploited GANs for SER. For instance, Wang et al. [197] used adversarial training to capture common representations for both the source and target language data. Zhou et al. [201] used a class-wise domain adaptation method using adversarial training to address cross-corpus mismatch issues and showed that adversarial training is useful when the model is to be trained on target language with minimal labels. Gideon et al. [202] used an adversarial discriminative domain generalisation method for cross-corpus emotion recognition and achieved better results. Similarly, [163] utilised GANs in an unsupervised way to learn language invariant, and evaluated over four different language datasets. They were able to significantly improve the SER across different language using language invariant features.

V-D2 Multi-Task Learning

Multi-task learning (MTL) has led to successes in different applications of ML, from NLP [203] and speech recognition [204] to computer vision [205]. MTL aims to optimise more than one loss function in contrast to single-task learning and uses auxiliary tasks to improve on the main task of interest [206]. Representations learned in MTL scenario become more generalised, which are very important in the field of speech processing, since speech contains multi-dimensional information (message, speaker, gender, or emotion) that can be used as auxiliary tasks. As a result, MTL increases performance without requiring external speech data.

In ASR, researchers have used MTL with different auxiliary tasks including gender [207], speaker adaptation [208, 209], speech enhancement [210, 211], etc. Results in these studies have shown that learning shared representations for different tasks act as complementary information about the acoustic environment and gave a lower word error rate (WER). Similar to ASR, researchers also explored MTL in SER with significantly improved results [165, 212]. For SER, studies used emotional attributes (e. g., arousal and valance) as auxiliary tasks [213, 214, 215, 216] as a way to improve the performance of the system. Other auxiliary tasks that researchers considered in SER are speaker and gender recognition [217, 165, 218] to improve the accuracy of the system compared to single-task learning.

MTL is an effective approach to learn shared representation that leads to no major increase of the computational power, while it improves the recognition accuracy of a system and also decreases the chance of overfitting [ranapironkov2016multi, 165]. However, MTL implies the preparation of labels for considered auxiliary tasks. Another problem that hinders MTL is dealing with temporality differences among tasks. For instance, the modelling of speaker recognition requires different temporal information than phenom recognition does [219]. Therefore, it is viable to use memory-based deep neural networks like the recurrent networks—ideally with LSTM or GRU cells—to deal with this issue.

V-D3 Self-Taught Learning

Self-taught learning [220] is a new paradigm in ML, which combines semi-supervised and TL. It utilises both labelled and unlabelled data, however, unlabelled data do not need to belong to the same class labels or generative distribution as the labelled data. Such a loose restriction on unlabelled data in self-taught learning significantly simplifies learning from a huge volume of unlabelled data. This fact differentiates it from semi-supervised learning.

We found very few studies on audio based applications using self-taught learning. In [221], the authors used self-taught learning and developed an assistive vocal interface for users with a speech impairment. The designed interface is maximally adapted using self-taught learning to the end-users and can be used for any language, dialect, grammar, and vocabulary. In another study [222], the authors proposed an AE-based sample selection method using self-taught learning. They selected highly relevant samples from unlabelled data and combined with training data. The proposed model was evaluated on four benchmark datasets covering computer vision, NLP, and speech recognition with results showing that the proposed framework can decrease the negative transfer while improving the knowledge transfer performance in different scenarios.

V-E Reinforcement Learning

Reinforcement Learning (RL) follows the principle of behaviourist psychology where an agent learns to take actions in an environment and tries to maximise the accumulated reward over its lifetime. In a RL problem, the agent and its environment can be modelled being in a state and the agent can perform actions , each of which may be members of either discrete or continuous sets and can be multi-dimensional. A state contains all related information about the current situation to predict future states. The goal of RL is to find a mapping from states to actions, called policy , that picks actions in given states by maximising the cumulative expected reward. The policy

can be deterministic or probabilistic. RL approaches are typically based on the Markov decision process (MDP) consisting of the set of states

, the set of actions , the rewards , and transition probabilities that capture the dynamics of a system. RL has been repeatedly successful in solving various problems [223]. Most importantly, deep RL that combines deep learning with RL principles. Methods such as deep Q-learning have significantly advanced the field [224].

Few studies used RL-based approaches to learn representations. For instance, in [225], the authors introduced DeepMDP, a parameterised latent space model that is trained by minimising two tractable latent space losses including prediction of rewards and prediction of the distribution over the next latent states. They showed that the optimisation of these two objectives guarantees the quality of the embedding function as a representation of the state space. They also show that utilising DeepMDP as an auxiliary task in the Atari 2600 domain leads to large performance improvements. Zhang et al. [226] used RL for learning optimised structured representation learning from text. They found that an RL model can learn task-friendly representations by identifying task-relevant structures without any explicit structure annotations, which yields competitive performance.

Recently, RL is also gaining interest in the speech community and researchers have proposed multiple approaches to model different speech problems. Some of the popular RL-based solutions include dialog modelling and optimisation [227, 228], speech recognition [229], and emotion recognition [230]. However, the problem of representation learning of speech signals is not explored using RL.

V-F Active Learning and Cooperative Learning

Active Learning aims at achieving improved accuracy with fewer training samples by selecting data from which it learns. This idea of cleverly picking training samples rather than random selection gives better predictive models with less human effort for labelling data [231]. An active learner selects samples from a large pool of unlabelled data and subsequently asks queries to an oracle (e.g., human annotator) for labelling. In speech processing, accurate labelling of speech utterances is extremely important and time-consuming. It has larger abundantly available unlabelled data. In this situation, active learning can help by allowing the learning model to select the samples from it learns, which leads to better performance with less training. Studies (e.g., [232, 233]) utilised classical ML-based active learning for ASR with the aim to minimise the effort required in transcribing and labelling data. However, it has been showed in [234, 235] that utilisation of deep models for active learning in speech processing can improve the performance and significantly reduce the requirement of labelled data.

Cooperative learning [235, 236] combines active and semi-supervised learning to best exploit available data. It is an efficient way of sharing the labelling work between human and machine which leads to reduce the time and cost of human annotation [237]. In cooperative learning, predicted samples with insufficient confidence value are subjected to human annotations and other with with high confidence values are labelled by machines. The models trained via cooperative learning perform better compared to active or semi-supervised learning [235]. In speech processing, a few studies utilised ML-based cooperative learning and showed its potential to significantly reduce data annotation efforts. For instance, [238], authors applied cooperative learning speed up the process of annotation of large multi-modal corpora. Similarly, the proposed model in [235] achieved the same performance with 75% fewer labelled instances compared to the model trained on the whole training data. These finding shows the potential of cooperative learning in speech processing, however, DL-based representation learning methods need to be investigated in this setting.

V-G Summarising the Findings

A summary of various representation learning techniques has been presented in Table V. We segregated the studies based on the learning techniques used to train the representation learning models. Studies on supervised learning methods typically use models to learn discriminative and noise-robust representations. Supervised training of models like CNNs, LSTM/GRU RNNs, and CNN-LSTM/GRU-RNNs are widely exploited for learning of representations from raw speech.

Unsupervised learning is to learn patterns in the data. We covered unsupervised representation learning for three speech applications. Autoencoding networks are widely used in unsupervised feature learning from speech. Most importantly, DAEs are very popular due to their denoising abilities. They can learn high-level representations from speech that are robust to noise corruption. Some studies also exploited AEs and RBM based architectures for unsupervised feature learning due to their non-linear dimension reduction and long-range of features extraction abilities. Recently, VAEs are becoming very popular in learning speech representation due to their generative nature and distribution learning abilities. They can learn salient and robust features from speech that are very essential for speech applications including ASR, SR, and SER.

Semi-supervised representation learning models are widely used in SER because speech corpora have smaller sizes compared to ASR and SR. These studies tried to exploit additional data to improve the performance of SER. The popular models include AE-based models [171, 165] and other discriminative architectures [239, 167]. In ASR, semi-supervised learning is mostly exploited for learning noise-robust representations [240, 241] and for feature extraction [172, 173, 242].

Transfer learning methods—especially domain adaptation and MTL—are very popular in ASR, SR, and SER. The domain adaptations methods in ASR, SER, and SR are mainly used to achieve adaptation by learning such representations that are robust against noise, speaker, and language and corpus difference. In all these speech applications, adversarially learnt representations are found to better solve the issue of domain mismatch. MTL methods are most popular in SER, where researchers tried to utilise additional information available (e. g., speaker, or gender) in speech to learn more generalised representations that help to improve the performance.

Learning Type Applications Aim Models
Supervised ASR
To learn discriminative and robust
DBNs ([243, 244, 140, 245, 246, 247, 248]), DNNs ([124, 249, 250, 251, 252]),
CNNs [253, 40, 254],
GANs ([255])
DBNs ([256, 137, 257, 246, 258, 259]), DNNs [260, 261, 262, 263, 264],
CNNs ([63, 265, 266, 267, 268, 269]), LSTM ([270, 271, 272, 273])
CNN-RNNs ([274]), GANs [275]
DNNs ([276, 277, 278]), RNNs ([279, 280]),
CNNs ([281, 282, 283, 284]),
CNN-RNNs ([285, 286, 287])
ASR To learn representation from raw speech
CNNs ([107, 108, 288, 107, 289, 290, 291]),
CNN-LSTM ([144, 292, 293, 294])
CNNs ([295, 296, 297]), CNN-LSTM ([298])
CNN-LSTM ([41, 142, 106]),
DNN-LSTM [143, 299]
Unsupervised ASR
To learn speech feature and and noise
robust representation.
DBNs ([136]), DNNs ([300]) CNNs ([301]),
LSTM ([302]) AEs ([303]), VAEs ([183])
DAEs ([147, 304, 148, 305, 149])
DBNs ([136]), DAEs ([306]),
VAEs ([307, 46, 308, 146]), AL ([161]), GAN ([309])
AEs ([152, 153, 310, 154]), DAEs [155, 156],
VAEs [79, 311], AAEs ([160]), GANs ([312])
ASR To learn feature from raw speech.
RBMs ([150]), VAEs [135], GANs ([159])
To learn speech feature representations
in semi-supervised way.
DNNs ([172, 173]), AEs ([174, 313]), GANs [314]
DNNs ([175]), GANs [176]
DNNs([239]), CNNs ([167]), AEs ([171]),
AAEs ([165])
To learn representation to minimise the
acoustic mismatch between the training
and testing conditions.
DNNs [315, 316], AEs ([182]), VAEs ([46])
AL ([197]), DAE ([196]), GANs ([198])
DBNs [317], CNNs ([318]), AL ([319, 320, 201]),
AEs [321, 200, 199, 118, 322]
To learn common representations
using multi-objective training.
DNNs [323, 324], RNNs ([325, 326, 327]), AL ([190])
DNNs ([328, 329]), CNNs ([330]), RNNs ([331])
DBNs ([332]), DNNs ([333, 214, 334, 216]),
CNN-LSTM [335], LSTM ([336, 337, 217, 213]),
AL ([338]), GANs ([157])
TABLE V: Review of representation learning techniques used in different studies.

Vi Challenges For Representation Learning

In this section, we discuss the challenges faced by representation learning. The summary of these challenges is presented in Figure 4.

Fig. 4: Challenges of representation learning.

Vi-a Challenge of Training Deep Architectures

Theoretical and empirical evidence show the deep models usually have superior performance over classical machine learning techniques. It is also empirically validated that deep learning models require much more data to learn certain attributes efficiently [339]. For instance, adding top-5 layers in the network on the 1000-class ImageNet dataset trained network has increased the accuracy from 84 % [104] to 95 % [340]. However, training deep learning models is not straightforward; it becomes considerably more difficult to optimise a deeper network [341, 342]. For deeper models, network parameters become very large and tuning of different hyper-parameter is also very difficult. Due to the availability of modern graphics processing units (GPUs) and recent advancement in optimisation [343] and training strategies [344, 345], the training of DNNs considerably accelerated; however, it is still an open research problem.

Training of representation learning models is a more tricky and difficult task. Learning high-level abstraction means more non-linearity and learning representations associated with input manifolds becoming even more complex if the model might need to unfold and distort complicated input manifolds. Learning such representation which involves disentangling and unfolding of complex manifolds requires more intense and difficult training [1]. Natural speech has very complex manifolds [346]) and inherently contains information about the message, gender, age, health status, personality, friendliness, mood, and emotion. All of this information is entangled together [347], and the disentanglement of these attributes in some latent space is a very difficult task that requires extensive training. Most importantly, the training of unsupervised representation learning models is much more difficult in contrast to supervised ones. As highlighted in [1], in supervised learning, there is a clear objective to optimise. For instance, the classifiers are trained to learn such representations or features that minimise the misclassifications error. Representation learning models do not have such training objectives like classification or regression problems do.

As outlined, GANs are a novel approach for generative modelling, they aim to learn the distribution of real data points. In recent years, they have been widely utilised for representation learning in different fields including speech. However, they also proved difficult to train and face different failure modes, mainly vanishing gradients issues, convergence problems, and mode collapse issues. Different remedies are proposed to tackle these issues. For instance, modified minimax loss [11] can help to deal with vanishing gradients, the Wasserstein loss [348] and training of ensembles alleviate mode collapse, and noise addition to the discriminator inputs [349] or penalising discriminator weights [350] act as regularisation to improve a GAN’s convergence. These are some earlier attempts to solve these issues; however, there is still room to improve the GANs training problems.

Vi-B Performance Issues of Domain Invariant Features

To achieve generalisation in DL models, we need a large amount of data with similar training and testing examples. However, the performance of DL models drops significantly if test samples deviate from the distribution of the training data. Learning speech representations that are invariant to variabilities in speakers, language, etc., are very difficult to capture. The performance of representations learnt from one corpus do not work well to another corpus having different recording conditions. This issue is common to all three applications of speech covered in this paper. In the past few years, researchers have achieved competitive performance by learning speaker invariant representations [351, 210]. However, language invariant representation is still very challenging. The main reason is that we have speech corpora covering only a few languages in contrast to the number of spoken languages in the world. There are more than 5 000 spoken languages in the world, but only 389 languages account for 94 % of the world’s population111 We do not have speech corpora even for 389 languages to enable across language speech processing research. This variation, imbalance, diversity, and dynamics in speech and language corpora are causing hurdles to designing generalised representation learning algorithms.

Vi-C Adversary on Representation Learning

DL has undoubtedly offered tremendous improvements in the performance of state-of-the-art speech representation learning systems. However, recent works on adversarial examples pose enormous challenges for robust representation learning from speech by showing the susceptibility of DNNs to adversarial examples having imperceptible perturbations [86]. Some popular adversarial attacks include the fast gradient sign method (FGSM) [352], Jacobian-based saliency map attack (JSMA) [353], and DeepFool [354]

; they compute the perturbation noise based on the gradient of targeted output. Such attacks are also evaluated against speech-based systems. For instance, Carlini and Wagner

[355] evaluated an iterative optimisation-based attack against DeepSpeech [356] (a state-of-the-art ASR model) with 100 % success rate. Some other attempts also proposed different adversarial attacks [357, 358] and against speech-based systems. The success of adversarial attacks against DL models shows that the representations learnt by them are not good [359]. Therefore, research is ongoing to tackle the challenge of adversarial attacks by exploring what DL models capture from the input data and how adversarial examples can be defined as a combination of previously learnt representations without any knowledge of adversaries.

Vi-D Quality of Speech Data

Representation learning models aim to identify potentially useful and ultimately understandable patterns. This demands not just more data, but more comprehensive and diverse data. Therefore, for learning a good representation, data must be correct and properly labelled, and unbiased. The quality of speech data can be poor due to various reasons. For example, different background noises and music can corrupt the speech data. Similarly, the noise of microphones or recording devices can also pollute the speech signal. Although studies use ‘noise injection’ techniques to avoid overfitting, this works for moderately high signal-to-noise ratios 

[360]. This has been an active research topic and in the past few years, different DL models have been proposed that can learn representations from noisy data. For instance, DAEs [119]

can learn a representation of data with noise, imputation AE

[361] can learn a representation from incomplete data, and non-local AE [362] can learn reliable features from corrupted data. Such techniques are also very popular in the speech community for noise invariant representation learning and we highlight this in Table V. However, there is still a need for such DL models that can deal with the quality of data not only for speech but also for other domains.

Vii Recent Advancements and Future Trends

Vii-a Open Source Datasets and Toolkits

There are a large number of speech databases available for speech analysis research. Some of the popular benchmark datasets including TIMIT [54], WSJ [56], AMI [57], and many other databases are not freely available. They are usually purchased from commercial organisations like LDC222, ELRA333, and Speech Ocean444 The licence fees of these datasets are affordable for most of the research institutes; however, their fee is expensive (e. g., the WSJ corpus license costs 2500 USD) for young researchers who want to start their research on speech, particularly for researchers in developing countries. Recently, a free data movement is started in the speech community and different good quality datasets are made free for the public to invoke more research in this field. VoxForge555 and OpenSLR666 are two popular platforms that contain freely available speech and speaker recognition datasets. Most of the SER corpora are developed by research institutes and they are freely available for research proposes.

Feature Extraction Tools
Tools Description
Librosa [363]
A Python based toolkit for music
and audio analysis
A Python library facilities a wide range of
feature extraction and also classification of
audio signals, supervised and unsupervised
segmentation and content visualisation.
openSMILE [365]
It enables to extract a large number of
audio feature in real time. It is written in C++.
Speech Recognition Toolkits
Trained Models
CMU Sphinx [366]
Jave, C, Python,
and others
English plus 10
other languages
Kaldi [367]
C++, Python
Julius [368]
C, Python
ESPnet [369]
English, Japanese,
HTK [370]
C, Python
Speaker Identification
ALIZE [371]
Speech Emotion Recognition
OpenEAR [372]
English, German
TABLE VI: Some popular tools for speech feature extraction and model implementations.

Another important progress made by researchers of the speech community is the development of open-source toolkits for speech processing and analysis. These tools help the researchers not only for feature extraction, but also for the development of models. The details of such tools is presented in a tabular form in Table VI. It can be noted that ASR —as the largest field of activity— has more open source toolkits compared to SR and SER. The development of such toolkits and speech corpus is providing great benefits to the speech research community and will continue to be needed to speed up the research progress on speech.

Vii-B Computational Advancements

In contrast to classical ML models, DL has a significantly larger number of parameters and involves huge amounts of matrix multiplications with many other operations. Traditional central processing units (CPUs) support such processing, therefore, advanced parallel computing is necessary for the development of deep networks. This is achieved by utilisation of graphics processing units (GPUs), which contain thousands of cores that can perform exceptionally fast matrix multiplications. In contrast to CPUs and GPUs, advanced Tensor Processing Units (TPUs) developed by Google offer 15-30

higher processing speeds and 30-80 higher performance-per-watt [373]. A recent paper [374] on quantum supremacy using programmable superconducting processor shows amazing results by performing computation a Hilbert space of dimension (253 9 1015) far beyond the reach of the fastest supercomputers available today. It was the first computation on a quantum processor. This will lead to more progress and the computational power will continue to grow at a double-exponential rate. This will disrupt the area of representation learning from a vast amount of unlabelled data by unlocking new computational capabilities of quantum processors.

Vii-C Processing Raw Speech

In the past few years, the trend of using hand-engineered acoustic features is progressively changing and DL is gaining popularity as a viable alternative to learn from raw speech directly. This has removed the feature extraction module from the pipeline of the ASR, SR, and SER systems. Recently, important progress is also made by Donahue et al. [159] in audio generation. They proposed WaveGAN for the unsupervised synthesis of raw-waveform audio and showed that their model can learn to produce intelligible words when trained on a small vocabulary speech dataset, and can also synthesise music audio and bird vocalisations. Other recent works [375, 376] also explored audio synthesis using GANs; however, such work is at the initial stage and will likely open new prospects of future research as it transpired with the use of GANs in the domain of vision (e. g., with DeepFakes [377]).

Vii-D The Rise of Adversarial Training

The idea of adversarial training was proposed in 2014 [11]. It leads to widespread research in various ML domains including speech representation learning. Speech-based systems—principally, ASR, SR, and SER systems—need to be robust under environmental acoustic variabilities arising from environmental, speaker, and recording conditions. This is very crucial for industrial applications of these systems.

GANs are being used as a viable tool for robust speech representation learning [255] and also speech enhancement [275] to tackle the noise issues. A popular variant of GAN, cycle-consistent Generative Adversarial Networks (CycleGAN) [378] is being used for domain adaptation for low-resource scenarios (where a limited amount of target data is available for adaptation) [379]. These results using CycleGANs on speech are very promising for domain adaptation. This will also lead to designing such systems that can learn domain invariant representation learning, especially for zero-resource languages to enable speech-based cross-culture applications.

Another interesting utilisation of GANs is learning from synthetic data. Researchers succeeded in the synthesis of speech signals also by GANs [159]. Synthetic data can be utilised for such applications where large label data is not available. In SER, larger labelled data is not available. Learning representation from synthetic data can help to improve the performance of system and researchers have explored the use of synthetic data for SER [160, 380]. This shows the feasibility of learning from synthetic data and will lead to interesting research to solve the problems in the speech domain where data scarcity is a major problem.

Vii-E Representation Learning with Interaction

Good representation disentangles the underlying explanatory factors of variation. However, it is an open research question that what kind of training framework can potentially learn disentangled representations from input data. Most of the research work on representation learning used static settings without involving the interaction with the environment. Reinforcement learning (RL) facilitates the idea of learning while interacting with the environment. If RL is used to disentangle factors of variation by interacting with the environment, a good representation can be learnt. This will lead to faster convergence, in contrast, to blindly attempting to solve given problems. Such an idea has recently been validated by Thomas et al. [381], where the authors used RL to disentangle the independently controllable factors of variation by using a specific objective function. The authors empirically showed that the agent can disentangle these aspects of the environment without any extrinsic reward. This is an important finding that will act as the key to further research in this direction.

Vii-F Privacy Preserving Representations

When people use speech-based services such as voice authentication or speech recognition, they grant complete access to their recordings. These services can extract user’s information such as gender, ethnicity, and emotional state and can be used for undesired purposes. Various other privacy-related issues arise while using speech-based services [382]. It is desirable in speech processing applications that there are suitable provisions for ensuring that there is no unauthorised and undisclosed eavesdropping and violation of privacy. Privacy preserved representation learning is a relatively unexplored research topic. Recently, researchers have started to utilise privacy-preserving representation learning models to protect speaker identity [383], gender identity [384]. To preserve users’ privacy, federated learning [385] is another alternative setting where the training of a shared global model is performed using multiple participating computing devices. This happens under the coordination of a central server, however, the training data remains decentralised.

Viii Conclusions

In this article, we have focused on providing a comprehensive review of representation learning for speech signals using deep learning approaches in three principal speech processing areas: automatic speech recognition (ASR), speaker recognition (SR), and speech emotion recognition (SER). In all of these three areas, the use of representation learning is very promising, and there is an ongoing research on this topic in which different models and methods are being explored to disentangle speech attributes suitable for these tasks. The literature review performed in this work shows that LSTM/GRU-RNNs in combination with CNNs are suitable for capturing speech attributes. Most of the studies have used LSTM models in a supervised way. In unsupervised representation learning, DAEs and VAEs are widely deployed architectures in the speech community, with GAN-based models also attaining prominence for speech enhancement and feature learning. Apart from providing a detailed review, we have also highlighted the challenges faced by researchers working with representation learning techniques and avenues for future work. It is hoped that this article will become a definitive guide to researchers and practitioners interested to work either in speech signal or deep representation learning in general. We are curious whether in the longer run, representation learning will be the standard paradigm in speech processing. If so, we are currently witnessing the change of a paradigm moving away from signal processing and expert-crafted features into a highly data-driven era—with all its advantages, challenges, and risks.


  • [1] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
  • [2] C. Herff and T. Schultz, “Automatic speech recognition from neural signals: a focused review,” Frontiers in neuroscience, vol. 10, p. 429, 2016.
  • [3] D. T. Tran, “Fuzzy approaches to speech and speaker recognition,” Ph.D. dissertation, university of Canberra, 2000.
  • [4] M. M. Najafabadi, F. Villanustre, T. M. Khoshgoftaar, N. Seliya, R. Wald, and E. Muharemagic, “Deep learning applications and challenges in big data analytics,” Journal of Big Data, vol. 2, no. 1, p. 1, 2015.
  • [5] X. Huang, J. Baker, and R. Reddy, “A historical perspective of speech recognition,” Commun. ACM, vol. 57, no. 1, pp. 94–103, 2014.
  • [6] G. Zhong, L.-N. Wang, X. Ling, and J. Dong, “An overview on data representation learning: From traditional feature learning to recent deep learning,” The Journal of Finance and Data Science, vol. 2, no. 4, pp. 265–278, 2016.
  • [7] Z. Zhang, J. Geiger, J. Pohjalainen, A. E.-D. Mousa, W. Jin, and B. Schuller, “Deep learning for environmentally robust speech recognition: An overview of recent developments,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 9, no. 5, p. 49, 2018.
  • [8] M. Swain, A. Routray, and P. Kabisatpathy, “Databases, features and classifiers for speech emotion recognition: a review,” International Journal of Speech Technology, vol. 21, no. 1, pp. 93–120, 2018.
  • [9] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, “Speech recognition using deep neural networks: A systematic review,” IEEE Access, vol. 7, pp. 19 143–19 165, 2019.
  • [10] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  • [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
  • [12] J. Gómez-García, L. Moro-Velázquez, and J. I. Godino-Llorente, “On the design of automatic voice condition analysis systems. part ii: Review of speaker recognition techniques and study on the effects of different variability factors,” Biomedical Signal Processing and Control, vol. 48, pp. 128–143, 2019.
  • [13] G. Zhong, X. Ling, and L.-N. Wang, “From shallow feature learning to deep learning: Benefits from the width and depth of deep architectures,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 9, no. 1, p. e1255, 2019.
  • [14] K. Pearson, “Liii. on lines and planes of closest fit to systems of points in space,” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, vol. 2, no. 11, pp. 559–572, 1901.
  • [15] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals of eugenics, vol. 7, no. 2, pp. 179–188, 1936.
  • [16] D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor, “Canonical correlation analysis: An overview with application to learning methods,” Neural computation, vol. 16, no. 12, pp. 2639–2664, 2004.
  • [17] I. Borg and P. Groenen, “Modern multidimensional scaling: Theory and applications,” Journal of Educational Measurement, vol. 40, no. 3, pp. 277–280, 2003.
  • [18] A. Hyvärinen and E. Oja, “Independent component analysis: algorithms and applications,” Neural networks, vol. 13, no. 4-5, pp. 411–430, 2000.
  • [19]

    B. Schölkopf, A. Smola, and K.-R. Müller, “Nonlinear component analysis as a kernel eigenvalue problem,”

    Neural computation, vol. 10, no. 5, pp. 1299–1319, 1998.
  • [20] G. Baudat and F. Anouar, “Generalized discriminant analysis using a kernel approach,” Neural computation, vol. 12, no. 10, pp. 2385–2404, 2000.
  • [21] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, p. 788, 1999.
  • [22] F. Lee, R. Scherer, R. Leeb, A. Schlögl, H. Bischof, and G. Pfurtscheller, Feature mapping using PCA, locally linear embedding and isometric feature mapping for EEG-based brain computer interface.   Citeseer, 2004.
  • [23] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” science, vol. 290, no. 5500, pp. 2323–2326, 2000.
  • [24] J. B. Tenenbaum, V. De Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” science, vol. 290, no. 5500, pp. 2319–2323, 2000.
  • [25] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.
  • [26] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al., “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [27] A. Kocsor and L. Tóth, “Kernel-based feature extraction with a speech technology application,” IEEE Transactions on Signal Processing, vol. 52, no. 8, pp. 2250–2263, 2004.
  • [28] T. Takiguchi and Y. Ariki, “Pca-based speech enhancement for distorted speech recognition.” Journal of multimedia, vol. 2, no. 5, 2007.
  • [29] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
  • [30] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of deep networks,” in Advances in neural information processing systems, 2007, pp. 153–160.
  • [31] C. Poultney, S. Chopra, Y. L. Cun et al., “Efficient learning of sparse representations with an energy-based model,” in Advances in neural information processing systems, 2007, pp. 1137–1144.
  • [32] M. Gales, S. Young et al., “The application of hidden markov models in speech recognition,” Foundations and Trends® in Signal Processing, vol. 1, no. 3, pp. 195–304, 2008.
  • [33] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recognition using time-delay neural networks,” IEEE transactions on acoustics, speech, and signal processing, vol. 37, no. 3, pp. 328–339, 1989.
  • [34] H. Purwins, B. Li, T. Virtanen, J. Schlüter, S.-Y. Chang, and T. Sainath, “Deep learning for audio signal processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 2, pp. 206–219, 2019.
  • [35] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal processing magazine, vol. 29, no. 6, pp. 82–97, 2012.
  • [36] M. Wöllmer, Y. Sun, F. Eyben, and B. Schuller, “Long short-term memory networks for noise robust speech recognition,” in Proc. INTERSPEECH 2010, Makuhari, Japan, 2010, pp. 2966–2969.
  • [37] M. Wöllmer, F. Eyben, S. Reiter, B. Schuller, C. Cox, E. Douglas-Cowie, and R. Cowie, “Abandoning emotion classes-towards continuous emotion recognition with modelling of long-range dependencies,” in Proc. 9th Interspeech 2008 incorp. 12th Australasian Int. Conf. on Speech Science and Technology SST 2008, Brisbane, Australia, 2008, pp. 597–600.
  • [38] S. Latif, M. Usman, R. Rana, and J. Qadir, “Phonocardiographic sensing using deep learning for abnormal heartbeat detection,” IEEE Sensors Journal, vol. 18, no. 22, pp. 9393–9400, 2018.
  • [39] A. Qayyum, S. Latif, and J. Qadir, “Quran reciter identification: A deep learning approach,” in 2018 7th International Conference on Computer and Communication Engineering (ICCCE).   IEEE, 2018, pp. 492–497.
  • [40] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional, long short-term memory, fully connected deep neural networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2015, pp. 4580–4584.
  • [41] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou, “Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network,” in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2016, pp. 5200–5204.
  • [42] M. Längkvist, L. Karlsson, and A. Loutfi, “A review of unsupervised feature learning and deep learning for time-series modeling,” Pattern Recognition Letters, vol. 42, pp. 11–24, 2014.
  • [43] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” arXiv preprint arXiv:1601.06759, 2016.
  • [44] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
  • [45] B. Bollepalli, L. Juvela, and P. Alku, “Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis,” arXiv preprint arXiv:1903.05955, 2019.
  • [46] W.-N. Hsu, Y. Zhang, and J. Glass, “Unsupervised learning of disentangled and interpretable representations from sequential data,” in Advances in neural information processing systems, 2017, pp. 1878–1889.
  • [47] S. Furui, “Speaker-independent isolated word recognition based on emphasized spectral dynamics,” in ICASSP’86. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 11.   IEEE, 1986, pp. 1991–1994.
  • [48] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE transactions on acoustics, speech, and signal processing, vol. 28, no. 4, pp. 357–366, 1980.
  • [49] H. Purwins, B. Blankertz, and K. Obermayer, “A new method for tracking modulations in tonal music in audio data format,” in Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, vol. 6.   IEEE, 2000, pp. 270–275.
  • [50] F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S. Narayanan et al., “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,” IEEE Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202, 2015.
  • [51] M. Neumann and N. T. Vu, “Attentive convolutional neural network based speech emotion recognition: A study on the impact of input features, signal length, and acted speech,” arXiv preprint arXiv:1706.00612, 2017.
  • [52] N. Jaitly and G. Hinton, “Learning a better representation of speech soundwaves using restricted boltzmann machines,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2011, pp. 5884–5887.
  • [53] C. Sun, A. Shrivastava, S. Singh, and A. Gupta, “Revisiting unreasonable effectiveness of data in deep learning era,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 843–852.
  • [54] M. Fisher William, “The darpa speech recognition research database: Specifications and status/william m. fisher, george r. doddington, kathleen m. goudie-marshall,” in Proceedings of DARPA Workshop on Speech Recognition, 1986, pp. 93–99.
  • [55] J. J. Godfrey, E. C. Holliman, and J. McDaniel, “Switchboard: Telephone speech corpus for research and development,” in [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1.   IEEE, 1992, pp. 517–520.
  • [56] D. B. Paul and J. M. Baker, “The design for the wall street journal-based csr corpus,” in Proceedings of the workshop on Speech and Natural Language.   Association for Computational Linguistics, 1992, pp. 357–362.
  • [57] T. Hain, L. Burget, J. Dines, G. Garau, V. Wan, M. Karafi, J. Vepa, and M. Lincoln, “The ami system for the transcription of speech in meetings,” in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, vol. 4.   IEEE, 2007, pp. IV–357.
  • [58] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, “A database of german emotional speech,” in Ninth European Conference on Speech Communication and Technology, 2005.
  • [59] B. Schuller, S. Steidl, and A. Batliner, “The interspeech 2009 emotion challenge,” in Tenth Annual Conference of the International Speech Communication Association, 2009.
  • [60] F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, “Introducing the recola multimodal corpus of remote collaborative and affective interactions,” in 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG).   IEEE, 2013, pp. 1–8.
  • [61] T. Bänziger, M. Mortillaro, and K. R. Scherer, “Introducing the geneva multimodal expression corpus for experimental research on emotion perception.” Emotion, vol. 12, no. 5, p. 1161, 2012.
  • [62] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2015, pp. 5206–5210.
  • [63] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” arXiv preprint arXiv:1806.05622, 2018.
  • [64] A. Rousseau, P. Deléglise, and Y. Esteve, “Ted-lium: an automatic speech recognition dedicated corpus.” in LREC, 2012, pp. 125–129.
  • [65] D. Wang and X. Zhang, “Thchs-30: A free chinese speech corpus,” arXiv preprint arXiv:1512.01882, 2015.
  • [66] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).   IEEE, 2017, pp. 1–5.
  • [67] B. Milde and A. Köhn, “Open source automatic speech recognition for german,” in Speech Communication; 13th ITG-Symposium.   VDE, 2018, pp. 1–5.
  • [68] G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schroder, “The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent,” IEEE Transactions on Affective Computing, vol. 3, no. 1, pp. 5–17, 2012.
  • [69] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, p. 335, 2008.
  • [70] C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab, N. Sadoughi, and E. M. Provost, “Msp-improv: An acted corpus of dyadic interactions to study emotion perception,” IEEE Transactions on Affective Computing, no. 1, pp. 67–80, 2017.
  • [71] F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chagnolleau, S. Meignier, T. Merlin, J. Ortega-García, D. Petrovska-Delacrétaz, and D. A. Reynolds, “A tutorial on text-independent speaker verification,” EURASIP Journal on Advances in Signal Processing, vol. 2004, no. 4, p. 101962, 2004.
  • [72] B. Schuller, A. Batliner, S. Steidl, and D. Seppi, “Recognising Realistic Emotions and Affect in Speech: State of the Art and Lessons Learnt from the First Challenge,” Speech Communication, vol. 53, no. 9/10, pp. 1062–1087, November/December 2011.
  • [73] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning.   MIT press, 2016.
  • [74] Y. Bengio et al., “Learning deep architectures for ai,” Foundations and trends® in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
  • [75] L. Jing and Y. Tian, “Self-supervised visual feature learning with deep neural networks: A survey,” arXiv preprint arXiv:1902.06162, 2019.
  • [76] H. Liang, X. Sun, Y. Sun, and Y. Gao, “Text feature extraction based on deep learning: a review,” EURASIP journal on wireless communications and networking, vol. 2017, no. 1, pp. 1–12, 2017.
  • [77] K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata, “Audio-visual speech recognition using deep learning,” Applied Intelligence, vol. 42, no. 4, pp. 722–737, 2015.
  • [78] M. Usman, S. Latif, and J. Qadir, “Using deep autoencoders for facial expression recognition,” in 2017 13th International Conference on Emerging Technologies (ICET).   IEEE, 2017, pp. 1–6.
  • [79] S. Latif, R. Rana, J. Qadir, and J. Epps, “Variational autoencoders for learning latent representations of speech emotion: A preliminary study,” arXiv preprint arXiv:1712.08708, 2017.
  • [80] B. Mitra, N. Craswell et al., “An introduction to neural information retrieval,” Foundations and Trends® in Information Retrieval, vol. 13, no. 1, pp. 1–126, 2018.
  • [81] B. Mitra and N. Craswell, “Neural models for information retrieval,” arXiv preprint arXiv:1705.01509, 2017.
  • [82] J. Kim, J. Urbano, C. C. Liem, and A. Hanjalic, “One deep music representation to rule them all? a comparative analysis of different representation learning strategies,” Neural Computing and Applications, pp. 1–27, 2018.
  • [83] A. Sordoni, “Learning representations for information retrieval,” 2016.
  • [84] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial machine learning at scale,” arXiv preprint arXiv:1611.01236, 2016.
  • [85] X. Liu, M. Cheng, H. Zhang, and C.-J. Hsieh, “Towards robust neural networks via random self-ensemble,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 369–385.
  • [86] S. Latif, R. Rana, and J. Qadir, “Adversarial machine learning and speech emotion recognition: Utilizing generative adversarial networks for robustness,” arXiv preprint arXiv:1811.11402, 2018.
  • [87] S. Liu, G. Keren, and B. W. Schuller, “N-HANS: Introducing the Augsburg Neuro-Holistic Audio-eNhancement System,”, no. 1911.07062, November 2019, 5 pages.
  • [88] R. Xu and D. C. Wunsch, “Survey of clustering algorithms,” 2005.
  • [89] E. Min, X. Guo, Q. Liu, G. Zhang, J. Cui, and J. Long, “A survey of clustering with deep learning: From the perspective of network architecture,” IEEE Access, vol. 6, pp. 39 501–39 514, 2018.
  • [90] J. J. DiCarlo and D. D. Cox, “Untangling invariant object recognition,” Trends in cognitive sciences, vol. 11, no. 8, pp. 333–341, 2007.
  • [91] I. Higgins, D. Amos, D. Pfau, S. Racaniere, L. Matthey, D. Rezende, and A. Lerchner, “Towards a definition of disentangled representations,” arXiv preprint arXiv:1812.02230, 2018.
  • [92] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep variational information bottleneck,” arXiv preprint arXiv:1612.00410, 2016.
  • [93] G. E. Hinton and S. T. Roweis, “Stochastic neighbor embedding,” in Advances in neural information processing systems, 2003, pp. 857–864.
  • [94] A. Errity and J. McKenna, “An investigation of manifold learning for speech analysis,” in Ninth International Conference on Spoken Language Processing, 2006.
  • [95] L. Cayton, “Algorithms for manifold learning,” Univ. of California at San Diego Tech. Rep, vol. 12, no. 1-17, p. 1, 2005.
  • [96] E. Golchin and K. Maghooli, “Overview of manifold learning and its application in medical data set,” International journal of biomedical engineering and science (IJBES), vol. 1, no. 2, pp. 23–33, 2014.
  • [97] Y. Ma and Y. Fu, Manifold learning theory and applications.   CRC press, 2011.
  • [98] P. Dayan, L. F. Abbott, and L. Abbott, “Theoretical neuroscience: computational and mathematical modeling of neural systems,” 2001.
  • [99] A. J. Simpson, “Abstract learning via demodulation in a deep neural network,” arXiv preprint arXiv:1502.04042, 2015.
  • [100] A. Cutler, “The abstract representations in speech processing,” The Quarterly Journal of Experimental Psychology, vol. 61, no. 11, pp. 1601–1619, 2008.
  • [101] F. Deng, J. Ren, and F. Chen, “Abstraction learning,” arXiv preprint arXiv:1809.03956, 2018.
  • [102] L. Deng, “A tutorial survey of architectures, algorithms, and applications for deep learning,” APSIPA Transactions on Signal and Information Processing, vol. 3, 2014.
  • [103] D. Svozil, V. Kvasnicka, and J. Pospichal, “Introduction to multi-layer feed-forward neural networks,” Chemometrics and intelligent laboratory systems, vol. 39, no. 1, pp. 43–62, 1997.
  • [104] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [105] Y. LeCun, Y. Bengio et al., “Convolutional networks for images, speech, and time series,” The handbook of brain theory and neural networks, vol. 3361, no. 10, p. 1995, 1995.
  • [106] S. Latif, R. Rana, S. Khalifa, R. Jurdak, and J. Epps, “Direct modelling of speech emotion from raw speech,” arXiv preprint arXiv:1904.03833, 2019.
  • [107] D. Palaz, R. Collobert et al., “Analysis of cnn-based speech recognition system using raw speech as input,” Idiap, Tech. Rep., 2015.
  • [108] D. Palaz, M. M. Doss, and R. Collobert, “Convolutional neural networks-based continuous speech recognition using raw speech signal,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2015, pp. 4295–4299.
  • [109] Z. Aldeneh and E. M. Provost, “Using regional saliency for speech emotion recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2017, pp. 2741–2745.
  • [110] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.
  • [111] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
  • [112] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [113]

    M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,”

    IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
  • [114] T. N. Sainath, R. Pang, D. Rybach, Y. He, R. Prabhavalkar, W. Li, M. Visontai, Q. Liang, T. Strohman, Y. Wu et al., “Two-pass end-to-end speech recognition,” arXiv preprint arXiv:1908.10992, 2019.
  • [115] G. E. Hinton and R. S. Zemel, “Autoencoders, minimum description length and helmholtz free energy,” in Advances in neural information processing systems, 1994, pp. 3–10.
  • [116] A. Ng et al., “Sparse autoencoder,” CS294A Lecture notes, vol. 72, no. 2011, pp. 1–19, 2011.
  • [117] A. Makhzani and B. Frey, “K-sparse autoencoders,” arXiv preprint arXiv:1312.5663, 2013.
  • [118] J. Deng, Z. Zhang, E. Marchi, and B. Schuller, “Sparse autoencoder-based feature transfer learning for speech emotion recognition,” in 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.   IEEE, 2013, pp. 511–516.
  • [119] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proceedings of the 25th international conference on Machine learning.   ACM, 2008, pp. 1096–1103.
  • [120] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Contractive auto-encoders: Explicit invariance during feature extraction,” in Proceedings of the 28th International Conference on International Conference on Machine Learning.   Omnipress, 2011, pp. 833–840.
  • [121] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.
  • [122] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, “A learning algorithm for boltzmann machines,” Cognitive science, vol. 9, no. 1, pp. 147–169, 1985.
  • [123] Y. Bengio, E. Laufer, G. Alain, and J. Yosinski, “Deep generative stochastic networks trainable by backprop,” in International Conference on Machine Learning, 2014, pp. 226–234.
  • [124] D. Yu, M. L. Seltzer, J. Li, J.-T. Huang, and F. Seide, “Feature learning in deep neural networks-studies on speech recognition tasks,” arXiv preprint arXiv:1301.3605, 2013.
  • [125] X. Li and X. Wu, “Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on.   IEEE, 2015, pp. 4520–4524.
  • [126] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
  • [127] J. Donahue, P. Krähenbühl, and T. Darrell, “Adversarial feature learning,” arXiv preprint arXiv:1605.09782, 2016.
  • [128] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “Infogan: Interpretable representation learning by information maximizing generative adversarial nets,” in Advances in neural information processing systems, 2016, pp. 2172–2180.
  • [129] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner, “beta-vae: Learning basic visual concepts with a constrained variational framework.” ICLR, vol. 2, no. 5, p. 6, 2017.
  • [130] S. Zhao, J. Song, and S. Ermon, “Infovae: Information maximizing variational autoencoders,” arXiv preprint arXiv:1706.02262, 2017.
  • [131] I. Gulrajani, K. Kumar, F. Ahmed, A. A. Taiga, F. Visin, D. Vazquez, and A. Courville, “Pixelvae: A latent variable model for natural images,” arXiv preprint arXiv:1611.05013, 2016.
  • [132] M. Tschannen, O. Bachem, and M. Lucic, “Recent advances in autoencoder-based representation learning,” arXiv preprint arXiv:1812.05069, 2018.
  • [133] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves et al., “Conditional image generation with pixelcnn decoders,” in Advances in neural information processing systems, 2016, pp. 4790–4798.
  • [134] D. Rethage, J. Pons, and X. Serra, “A wavenet for speech denoising,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 5069–5073.
  • [135] J. Chorowski, R. J. Weiss, S. Bengio, and A. v. d. Oord, “Unsupervised speech representation learning using wavenet autoencoders,” arXiv preprint arXiv:1901.08810, 2019.
  • [136] H. Lee, P. Pham, Y. Largman, and A. Y. Ng, “Unsupervised feature learning for audio classification using convolutional deep belief networks,” in Advances in neural information processing systems, 2009, pp. 1096–1104.
  • [137] H. Ali, S. N. Tran, E. Benetos, and A. S. d. Garcez, “Speaker recognition with hybrid features from a deep belief network,” Neural Computing and Applications, vol. 29, no. 6, pp. 13–19, 2018.
  • [138] S. Yaman, J. Pelecanos, and R. Sarikaya, “Bottleneck features for speaker recognition,” in Odyssey 2012-The Speaker and Language Recognition Workshop, 2012.
  • [139] Z. Cairong, Z. Xinran, Z. Cheng, and Z. Li, “A novel dbn feature fusion model for cross-corpus speech emotion recognition,” Journal of Electrical and Computer Engineering, vol. 2016, 2016.
  • [140] G. Dahl, A.-r. Mohamed, G. E. Hinton et al., “Phone recognition with the mean-covariance restricted boltzmann machine,” in Advances in neural information processing systems, 2010, pp. 469–477.
  • [141] H. Muckenhirn, M. M. Doss, and S. Marcell, “Towards directly modeling raw speech signal for speaker verification using cnns,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 4884–4888.
  • [142] P. Tzirakis, J. Zhang, and B. W. Schuller, “End-to-end speech emotion recognition using deep neural networks,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 5089–5093.
  • [143] M. Sarma, P. Ghahremani, D. Povey, N. K. Goel, K. K. Sarma, and N. Dehak, “Emotion identification from raw speech signals using dnns.” in Interspeech, 2018, pp. 3097–3101.
  • [144] T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals, “Learning the speech front-end with raw waveform cldnns,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
  • [145] M. Neumann and N. T. Vu, “Improving speech emotion recognition with unsupervised representation learning on unlabeled speech,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 7390–7394.
  • [146] W.-N. Hsu, Y. Zhang, R. J. Weiss, Y.-A. Chung, Y. Wang, Y. Wu, and J. Glass, “Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 5901–5905.
  • [147] X. Feng, Y. Zhang, and J. Glass, “Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition,” in 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2014, pp. 1759–1763.
  • [148] F. Weninger, S. Watanabe, Y. Tachioka, and B. Schuller, “Deep recurrent de-noising auto-encoder and blind de-reverberation for reverberated speech recognition,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2014, pp. 4623–4627.
  • [149] M. Zhao, D. Wang, Z. Zhang, and X. Zhang, “Music removal by convolutional denoising autoencoder in speech recognition,” in 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).   IEEE, 2015, pp. 338–341.
  • [150] H. B. Sailor and H. A. Patil, “Unsupervised deep auditory model using stack of convolutional rbms for speech recognition.” in INTERSPEECH, 2016, pp. 3379–3383.
  • [151] A.-r. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Nips workshop on deep learning for speech recognition and related applications, vol. 1, no. 9.   Vancouver, Canada, 2009, p. 39.
  • [152] S. Ghosh, E. Laksana, L.-P. Morency, and S. Scherer, “Representation learning for speech emotion recognition.” in Interspeech, 2016, pp. 3603–3607.
  • [153] ——, “Learning representations of affect from speech,” arXiv preprint arXiv:1511.04747, 2015.
  • [154] Z.-w. Huang, W.-t. Xue, and Q.-r. Mao, “Speech emotion recognition with unsupervised feature learning,” Frontiers of Information Technology & Electronic Engineering, vol. 16, no. 5, pp. 358–366, 2015.
  • [155] R. Xia, J. Deng, B. Schuller, and Y. Liu, “Modeling gender information for emotion recognition using denoising autoencoder,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2014, pp. 990–994.
  • [156] R. Xia and Y. Liu, “Using denoising autoencoder for emotion recognition.” in Interspeech, 2013, pp. 2886–2889.
  • [157] J. Chang and S. Scherer, “Learning representations of emotional speech with deep convolutional generative adversarial networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on.   IEEE, 2017, pp. 2746–2750.
  • [158] H. Yu, Z.-H. Tan, Z. Ma, and J. Guo, “Adversarial network bottleneck features for noise robust speaker verification,” arXiv preprint arXiv:1706.03397, 2017.
  • [159] C. Donahue, J. McAuley, and M. Puckette, “Synthesizing audio with generative adversarial networks,” arXiv preprint arXiv:1802.04208, 2018.
  • [160] S. Sahu, R. Gupta, G. Sivaraman, W. AbdAlmageed, and C. Espy-Wilson, “Adversarial auto-encoders for speech based emotion recognition,” arXiv preprint arXiv:1806.02146, 2018.
  • [161] M. Ravanelli and Y. Bengio, “Learning speaker representations with mutual information,” arXiv preprint arXiv:1812.00271, 2018.
  • [162] S. Latif, R. Rana, S. Younis, J. Qadir, and J. Epps, “Transfer learning for improving speech emotion classification accuracy,” arXiv preprint arXiv:1801.06353, 2018.
  • [163] S. Latif, J. Qadir, and M. Bilal, “Unsupervised adversarial domain adaptation for cross-lingual speech emotion recognition,” arXiv preprint arXiv:1907.06083, 2019.
  • [164] S. Latif, A. Qayyum, M. Usman, and J. Qadir, “Cross lingual speech emotion recognition: Urdu vs. western languages,” in 2018 International Conference on Frontiers of Information Technology (FIT).   IEEE, 2018, pp. 88–93.
  • [165] R. Rana, S. Latif, S. Khalifa, and R. Jurdak, “Multi-task semi-supervised adversarial autoencoding for speech emotion,” arXiv preprint arXiv:1907.06078, 2019.
  • [166] X. J. Zhu, “Semi-supervised learning literature survey,” University of Wisconsin-Madison Department of Computer Sciences, Tech. Rep., 2005.
  • [167] Z. Huang, M. Dong, Q. Mao, and Y. Zhan, “Speech emotion recognition using cnn,” in Proceedings of the 22Nd ACM International Conference on Multimedia, 2014.
  • [168] J. Huang, Y. Li, J. Tao, Z. Lian, M. Niu, and J. Yi, “Speech emotion recognition using semi-supervised learning with ladder networks,” in 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia).   IEEE, 2018, pp. 1–5.
  • [169] S. Parthasarathy and C. Busso, “Semi-supervised speech emotion recognition with ladder networks,” arXiv preprint arXiv:1905.02921, 2019.
  • [170] ——, “Ladder networks for emotion recognition: Using unsupervised auxiliary tasks to improve predictions of emotional attributes,” Proc. Interspeech 2018, pp. 3698–3702, 2018.
  • [171] J. Deng, X. Xu, Z. Zhang, S. Frühholz, and B. Schuller, “Semisupervised autoencoders for speech emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 1, pp. 31–43, 2018.
  • [172] S. Thomas, M. L. Seltzer, K. Church, and H. Hermansky, “Deep neural network features and semi-supervised training for low resource speech recognition,” in 2013 IEEE international conference on acoustics, speech and signal processing.   IEEE, 2013, pp. 6704–6708.
  • [173] J. Cui, B. Kingsbury, B. Ramabhadran, A. Sethy, K. Audhkhasi, X. Cui, E. Kislal, L. Mangu, M. Nussbaum-Thom, M. Picheny et al., “Multilingual representations for low resource speech recognition and keyword search,” in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).   IEEE, 2015, pp. 259–266.
  • [174] S. Karita, S. Watanabe, T. Iwata, A. Ogawa, and M. Delcroix, “Semi-supervised end-to-end speech recognition.” in Interspeech, 2018, pp. 2–6.
  • [175] Y. Tu, J. Du, Y. Xu, L. Dai, and C.-H. Lee, “Speech separation based on improved deep neural networks with dual outputs of speech features for both target and interfering speakers,” in The 9th International Symposium on Chinese Spoken Language Processing.   IEEE, 2014, pp. 250–254.
  • [176] M. Pal, M. Kumar, R. Peri, and S. Narayanan, “A study of semi-supervised speaker diarization system using GAN mixture model,” arXiv preprint arXiv:1910.11416, 2019.
  • [177] Y. Bengio, “Deep learning of representations for unsupervised and transfer learning,” in Proceedings of ICML workshop on unsupervised and transfer learning, 2012, pp. 17–36.
  • [178] S. Thrun and L. Pratt, Learning to learn.   Springer Science & Business Media, 2012.
  • [179] S. Sun, B. Zhang, L. Xie, and Y. Zhang, “An unsupervised deep domain adaptation approach for robust speech recognition,” Neurocomputing, vol. 257, pp. 79–87, 2017.
  • [180] P. Swietojanski, J. Li, and S. Renals, “Learning hidden unit contributions for unsupervised acoustic model adaptation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 8, pp. 1450–1463, 2016.
  • [181] W.-N. Hsu and J. Glass, “Extracting domain invariant features by unsupervised learning for robust automatic speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 5614–5618.
  • [182] H. Tang, W.-N. Hsu, F. Grondin, and J. Glass, “A study of enhancement, augmentation, and autoencoder methods for domain adaptation in distant speech recognition,” arXiv preprint arXiv:1806.04841, 2018.
  • [183] W.-N. Hsu, H. Tang, and J. Glass, “Unsupervised adaptation with interpretable disentangled representations for distant conversational speech recognition,” arXiv preprint arXiv:1806.04872, 2018.
  • [184] Y. Fan, Y. Qian, F. K. Soong, and L. He, “Unsupervised speaker adaptation for dnn-based tts synthesis,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2016, pp. 5135–5139.
  • [185] C.-X. Qin, D. Qu, and L.-H. Zhang, “Towards end-to-end speech recognition with transfer learning,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2018, no. 1, p. 18, 2018.
  • [186] J.-T. Huang, J. Li, D. Yu, L. Deng, and Y. Gong, “Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.   IEEE, 2013, pp. 7304–7308.
  • [187] K. M. Knill, M. J. Gales, A. Ragni, and S. P. Rath, “Language independent and unsupervised acoustic models for speech recognition and keyword spotting,” 2014.
  • [188] P. Denisov, N. T. Vu, and M. F. Font, “Unsupervised domain adaptation by adversarial learning for robust speech recognition,” in Speech Communication; 13th ITG-Symposium.   VDE, 2018, pp. 1–5.
  • [189] S. Sun, C.-F. Yeh, M.-Y. Hwang, M. Ostendorf, and L. Xie, “Domain adversarial training for accented speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 4854–4858.
  • [190] Y. Shinohara, “Adversarial multi-task learning of deep neural networks for robust speech recognition.” in INTERSPEECH.   San Francisco, CA, USA, 2016, pp. 2369–2372.
  • [191] E. Hosseini-Asl, Y. Zhou, C. Xiong, and R. Socher, “A multi-discriminator cyclegan for unsupervised non-parallel speech domain adaptation,” arXiv preprint arXiv:1804.00522, 2018.
  • [192] Z. Meng, J. Li, Y. Gong, and B.-H. Juang, “Adversarial teacher-student learning for unsupervised domain adaptation,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 5949–5953.
  • [193] Z. Meng, J. Li, Z. Chen, Y. Zhao, V. Mazalov, Y. Gang, and B.-H. Juang, “Speaker-invariant training via adversarial learning,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 5969–5973.
  • [194] Z. Meng, J. Li, and Y. Gong, “Adversarial speaker adaptation,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 5721–5725.
  • [195] A. Tripathi, A. Mohan, S. Anand, and M. Singh, “Adversarial learning of raw speech features for domain invariant speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 5959–5963.
  • [196] S. Shon, S. Mun, W. Kim, and H. Ko, “Autoencoder based domain adaptation for speaker recognition under insufficient channel information,” arXiv preprint arXiv:1708.01227, 2017.
  • [197] Q. Wang, W. Rao, S. Sun, L. Xie, E. S. Chng, and H. Li, “Unsupervised domain adaptation via domain adversarial training for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 4889–4893.
  • [198] G. Bhattacharya, J. Monteiro, J. Alam, and P. Kenny, “Generative adversarial speaker embedding networks for domain robust end-to-end speaker verification,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 6226–6230.
  • [199] J. Deng, Z. Zhang, F. Eyben, and B. Schuller, “Autoencoder-based unsupervised domain adaptation for speech emotion recognition,” IEEE Signal Processing Letters, vol. 21, no. 9, pp. 1068–1072, 2014.
  • [200] J. Deng, X. Xu, Z. Zhang, S. Frühholz, and B. Schuller, “Universum autoencoder-based domain adaptation for speech emotion recognition,” IEEE Signal Processing Letters, vol. 24, no. 4, pp. 500–504, 2017.
  • [201] H. Zhou and K. Chen, “Transferable positive/negative speech emotion recognition via class-wise adversarial domain adaptation,” arXiv preprint arXiv:1810.12782, 2018.
  • [202] J. Gideon, M. G. McInnis, and E. Mower Provost, “Barking up the right tree: Improving cross-corpus speech emotion recognition with adversarial discriminative domain generalization (addog),” arXiv preprint arXiv:1903.12094, 2019.
  • [203]

    R. Collobert and J. Weston, “A unified architecture for natural language processing: Deep neural networks with multitask learning,” in

    Proceedings of the 25th international conference on Machine learning.   ACM, 2008, pp. 160–167.
  • [204] L. Deng, G. Hinton, and B. Kingsbury, “New types of deep neural network learning for speech recognition and related applications: An overview,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.   IEEE, 2013, pp. 8599–8603.
  • [205] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
  • [206] R. Caruana, “Learning to learn, chapter multitask learning,” 1998.
  • [207] J. Stadermann, W. Koska, and G. Rigoll, “Multi-task learning strategies for a recurrent neural net in a hybrid tied-posteriors acoustic model,” in Ninth European Conference on Speech Communication and Technology, 2005.
  • [208] Z. Huang, J. Li, S. M. Siniscalchi, I.-F. Chen, J. Wu, and C.-H. Lee, “Rapid adaptation for deep neural networks through multi-task learning,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
  • [209] R. Price, K.-i. Iso, and K. Shinoda, “Speaker adaptation of deep neural networks using a hierarchy of output layers,” in 2014 IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2014, pp. 153–158.
  • [210] Y. Lu, F. Lu, S. Sehgal, S. Gupta, J. Du, C. H. Tham, P. Green, and V. Wan, “Multitask learning in connectionist speech recognition,” in Proceedings of the Australian International Conference on Speech Science and Technology, 2004.
  • [211] Z. Chen, S. Watanabe, H. Erdogan, and J. R. Hershey, “Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
  • [212] J. Han, Z. Zhang, Z. Ren, and B. W. Schuller, “Emobed: Strengthening monomodal emotion recognition via training with crossmodal emotion embeddings,” IEEE Transactions on Affective Computing, 2019.
  • [213] J. Kim, G. Englebienne, K. P. Truong, and V. Evers, “Towards speech emotion recognition” in the wild” using aggregated corpora and deep multi-task learning,” arXiv preprint arXiv:1708.03920, 2017.
  • [214] S. Parthasarathy and C. Busso, “Jointly predicting arousal, valence and dominance with multi-task learning,” INTERSPEECH, Stockholm, Sweden, 2017.
  • [215] R. Xia and Y. Liu, “A multi-task learning framework for emotion recognition using 2D continuous space,” IEEE Transactions on Affective Computing, no. 1, pp. 3–14, 2017.
  • [216] R. Lotfian and C. Busso, “Predicting categorical emotions by jointly learning primary and secondary emotions through multitask learning,” in Proc. Interspeech 2018, 2018, pp. 951–955. [Online]. Available:
  • [217] F. Tao and G. Liu, “Advanced LSTM: A study about better time dependency modeling in emotion recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 2906–2910.
  • [218] B. Zhang, E. M. Provost, and G. Essl, “Cross-corpus acoustic emotion recognition with multi-task learning: Seeking common ground while preserving differences,” IEEE Transactions on Affective Computing, no. 1, pp. 1–1, 2017.
  • [219] G. Pironkov, S. Dupont, and T. Dutoit, “Multi-task learning for speech recognition: an overview.” in ESANN, 2016.
  • [220] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng, “Self-taught learning: transfer learning from unlabeled data,” in Proceedings of the 24th international conference on Machine learning.   ACM, 2007, pp. 759–766.
  • [221] B. Ons, N. Tessema, J. Van De Loo, J. Gemmeke, G. De Pauw, W. Daelemans, and H. Van Hamme, “A self learning vocal interface for speech-impaired users,” in Proceedings of the Fourth Workshop on Speech and Language Processing for Assistive Technologies, 2013, pp. 73–81.
  • [222] S. Feng and M. F. Duarte, “Autoencoder based sample selection for self-taught learning,” arXiv preprint arXiv:1808.01574, 2018.
  • [223] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
  • [224] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, “Deep reinforcement learning: A brief survey,” IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 26–38, 2017.
  • [225] C. Gelada, S. Kumar, J. Buckman, O. Nachum, and M. G. Bellemare, “Deepmdp: Learning continuous latent space models for representation learning,” arXiv preprint arXiv:1906.02736, 2019.
  • [226] T. Zhang, M. Huang, and L. Zhao, “Learning structured representation for text classification via reinforcement learning,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • [227] H. Cuayáhuitl, S. Renals, O. Lemon, and H. Shimodaira, “Reinforcement learning of dialogue strategies with hierarchical abstract machines,” in 2006 IEEE Spoken Language Technology Workshop.   IEEE, 2006, pp. 182–185.
  • [228] J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and D. Jurafsky, “Deep reinforcement learning for dialogue generation,” arXiv preprint arXiv:1606.01541, 2016.
  • [229] K.-F. Lee and S. Mahajan, “Corrective and reinforcement learning for speaker-independent continuous speech recognition,” Computer Speech & Language, vol. 4, no. 3, pp. 231–245, 1990.
  • [230] E. Lakomkin, M. A. Zamani, C. Weber, S. Magg, and S. Wermter, “Emorl: continuous acoustic emotion classification using deep reinforcement learning,” in 2018 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2018, pp. 1–6.
  • [231] Y. Zhang, M. Lease, and B. C. Wallace, “Active discriminative text representation learning,” in Thirty-First AAAI Conference on Artificial Intelligence, 2017.
  • [232] D. Hakkani-Tur, G. Tur, M. Rahim, and G. Riccardi, “Unsupervised and active learning in automatic speech recognition for call classification,” in 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1.   IEEE, 2004, pp. I–429.
  • [233] G. Riccardi and D. Hakkani-Tur, “Active learning: Theory and applications to automatic speech recognition,” IEEE transactions on speech and audio processing, vol. 13, no. 4, pp. 504–511, 2005.
  • [234] J. Huang, R. Child, V. Rao, H. Liu, S. Satheesh, and A. Coates, “Active learning for speech recognition: the power of gradients,” arXiv preprint arXiv:1612.03226, 2016.
  • [235] Z. Zhang, E. Coutinho, J. Deng, and B. Schuller, “Cooperative learning and its application to emotion recognition from speech,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 23, no. 1, pp. 115–126, 2015.
  • [236] M. Dong and Z. Sun, “On human machine cooperative learning control,” in Proceedings of the 2003 IEEE International Symposium on Intelligent Control.   IEEE, 2003, pp. 81–86.
  • [237] B. W. Schuller, “Speech analysis in the big data era,” in International Conference on Text, Speech, and Dialogue.   Springer, 2015, pp. 3–11.
  • [238] J. Wagner, T. Baur, Y. Zhang, M. F. Valstar, B. Schuller, and E. André, “Applying cooperative machine learning to speed up the annotation of social signals in large multi-modal corpora,” arXiv preprint arXiv:1802.02565, 2018.
  • [239] Z. Zhang, J. Han, J. Deng, X. Xu, F. Ringeval, and B. Schuller, “Leveraging unlabeled data for emotion recognition with enhanced collaborative semi-supervised learning,” IEEE Access, vol. 6, pp. 22 196–22 209, 2018.
  • [240] Y.-H. Tu, J. Du, L.-R. Dai, and C.-H. Lee, “Speech separation based on signal-noise-dependent deep neural networks for robust speech recognition,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2015, pp. 61–65.
  • [241] A. Narayanan and D. Wang, “Investigation of speech separation as a front-end for noise robust speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 826–835, 2014.
  • [242] Y. Liu and K. Kirchhoff, “Graph-based semi-supervised acoustic modeling in dnn-based speech recognition,” in 2014 IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2014, pp. 177–182.
  • [243] A.-r. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training of deep belief networks for speech recognition,” in Eleventh Annual Conference of the International Speech Communication Association, 2010.
  • [244] A.-r. Mohamed, T. N. Sainath, G. E. Dahl, B. Ramabhadran, G. E. Hinton, M. A. Picheny et al., “Deep belief networks using discriminative features for phone recognition.” in ICASSP, 2011, pp. 5060–5063.
  • [245] M. Gholamipoor and B. Nasersharif, “Feature mapping using deep belief networks for robust speech recognition,” The Modares Journal of Electrical Engineering, vol. 14, no. 3, pp. 24–30, 2014.
  • [246] J. Huang and B. Kingsbury, “Audio-visual deep learning for noise robust speech recognition,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.   IEEE, 2013, pp. 7596–7599.
  • [247] Y.-J. Hu and Z.-H. Ling, “Dbn-based spectral feature representation for statistical parametric speech synthesis,” IEEE Signal Processing Letters, vol. 23, no. 3, pp. 321–325, 2016.
  • [248] D. Yu, L. Deng, and G. Dahl, “Roles of pre-training and fine-tuning in context-dependent dbn-hmms for real-world speech recognition,” in Proc. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2010.
  • [249] D. Yu and M. L. Seltzer, “Improved bottleneck features using pretrained deep neural networks,” in Twelfth annual conference of the international speech communication association, 2011.
  • [250] Z. Wu, C. Valentini-Botinhao, O. Watts, and S. King, “Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2015, pp. 4460–4464.
  • [251] D. Yu, F. Seide, G. Li, and L. Deng, “Exploiting sparseness in deep neural networks for large vocabulary speech recognition,” in 2012 IEEE International conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2012, pp. 4409–4412.
  • [252] P. Dighe, G. Luyet, A. Asaei, and H. Bourlard, “Exploiting low-dimensional structures to enhance DNN based acoustic modeling in speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2016, pp. 5690–5694.
  • [253] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition,” IEEE/ACM Transactions on audio, speech, and language processing, vol. 22, no. 10, pp. 1533–1545, 2014.
  • [254] V. Mitra and H. Franco, “Time-frequency convolutional networks for robust speech recognition,” in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).   IEEE, 2015, pp. 317–323.
  • [255] D. Serdyuk, K. Audhkhasi, P. Brakel, B. Ramabhadran, S. Thomas, and Y. Bengio, “Invariant representations for noisy speech recognition,” arXiv preprint arXiv:1612.01928, 2016.
  • [256] W. M. Campbell, “Using deep belief networks for vector-based speaker recognition,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
  • [257] O. Ghahabi and J. Hernando, “I-vector modeling with deep belief networks for multi-session speaker recognition,” network, vol. 20, p. 13, 2014.
  • [258] V. Vasilakakis, S. Cumani, P. Laface, and P. Torino, “Speaker recognition by means of deep belief networks,” Proc. Biometric Technologies in Forensic Science, pp. 52–57, 2013.
  • [259] T. Yamada, L. Wang, and A. Kai, “Improvement of distant-talking speaker identification using bottleneck features of dnn.” in Interspeech, 2013, pp. 3661–3664.
  • [260] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end text-dependent speaker verification,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2016, pp. 5115–5119.
  • [261] H.-s. Heo, J.-w. Jung, I.-h. Yang, S.-h. Yoon, and H.-j. Yu, “Joint training of expanded end-to-end DNN for text-dependent speaker verification.” in INTERSPEECH, 2017, pp. 1532–1536.
  • [262] H. Huang and K. C. Sim, “An investigation of augmenting speaker representations to improve speaker normalisation for dnn-based speech recognition,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2015, pp. 4610–4613.
  • [263] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 5329–5333.
  • [264] Y. Z. Işik, H. Erdogan, and R. Sarikaya, “S-vector: A discriminative representation derived from i-vector for speaker verification,” in 2015 23rd European Signal Processing Conference (EUSIPCO).   IEEE, 2015, pp. 2097–2101.
  • [265] S.-X. Zhang, Z. Chen, Y. Zhao, J. Li, and Y. Gong, “End-to-end attention based text-dependent speaker verification,” in 2016 IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2016, pp. 171–178.
  • [266] C. Zhang and K. Koishida, “End-to-end text-independent speaker verification with triplet loss on short utterances.” in Interspeech, 2017, pp. 1487–1491.
  • [267] M. McLaren, Y. Lei, N. Scheffer, and L. Ferrer, “Application of convolutional neural networks to speaker recognition in noisy conditions,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
  • [268] Y. Lukic, C. Vogt, O. Dürr, and T. Stadelmann, “Speaker identification and clustering using convolutional neural networks,” in 2016 IEEE 26th international workshop on machine learning for signal processing (MLSP).   IEEE, 2016, pp. 1–6.
  • [269] S. Ranjan and J. H. Hansen, “Improved gender independent speaker recognition using convolutional neural network based bottleneck features.” in INTERSPEECH, 2017, pp. 1009–1013.
  • [270] S. Wang, Y. Qian, and K. Yu, “What does the speaker embedding encode?” in Interspeech, 2017, pp. 1497–1501.
  • [271] S. Shon, H. Tang, and J. Glass, “Frame-level speaker embeddings for text-independent speaker recognition and analysis of end-to-end model,” in 2018 IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2018, pp. 1007–1013.
  • [272] E. Marchi, S. Shum, K. Hwang, S. Kajarekar, S. Sigtia, H. Richards, R. Haynes, Y. Kim, and J. Bridle, “Generalised discriminative transform via curriculum learning for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 5324–5328.
  • [273] M. McLaren, Y. Lei, and L. Ferrer, “Advances in deep neural network approaches to speaker recognition,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2015, pp. 4814–4818.
  • [274] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan, and Z. Zhu, “Deep speaker: an end-to-end neural speaker embedding system,” arXiv preprint arXiv:1705.02304, 2017.
  • [275] D. Michelsanti and Z.-H. Tan, “Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification,” arXiv preprint arXiv:1709.01703, 2017.
  • [276] J. Lorenzo-Trueba, G. E. Henter, S. Takaki, J. Yamagishi, Y. Morino, and Y. Ochiai, “Investigating different representations for modeling and controlling multiple emotions in dnn-based speech synthesis,” Speech Communication, vol. 99, pp. 135–143, 2018.
  • [277] Z.-Q. Wang and I. Tashev, “Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2017, pp. 5150–5154.
  • [278] M. N. Stolar, M. Lech, and I. S. Burnett, “Optimized multi-channel deep neural network with 2D graphical representation of acoustic speech features for emotion recognition,” in 2014 8th International Conference on Signal Processing and Communication Systems (ICSPCS).   IEEE, 2014, pp. 1–6.
  • [279] J. Lee and I. Tashev, “High-level feature representation using recurrent neural network for speech emotion recognition,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
  • [280] C.-W. Huang and S. S. Narayanan, “Attention assisted discovery of sub-utterance structure in speech emotion recognition.” in INTERSPEECH, 2016, pp. 1387–1391.
  • [281] J. Kim, K. P. Truong, G. Englebienne, and V. Evers, “Learning spectro-temporal features with 3D CNNs for speech emotion recognition,” in 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII).   IEEE, 2017, pp. 383–388.
  • [282] W. Zheng, J. Yu, and Y. Zou, “An experimental study of speech emotion recognition based on deep convolutional neural networks,” in 2015 international conference on affective computing and intelligent interaction (ACII).   IEEE, 2015, pp. 827–831.
  • [283] Q. Mao, M. Dong, Z. Huang, and Y. Zhan, “Learning salient features for speech emotion recognition using convolutional neural networks,” IEEE transactions on multimedia, vol. 16, no. 8, pp. 2203–2213, 2014.
  • [284] P. Li, Y. Song, I. V. McLoughlin, W. Guo, and L. Dai, “An attention pooling based representation learning method for speech emotion recognition.” in Interspeech, 2018, pp. 3087–3091.
  • [285] Z. Zhao, Y. Zheng, Z. Zhang, H. Wang, Y. Zhao, and C. Li, “Exploring spatio-temporal representations by integrating attention-based bidirectional-lstm-rnns and fcns for speech emotion recognition,” 2018.
  • [286] D. Luo, Y. Zou, and D. Huang, “Investigation on joint representation learning for robust feature extraction in speech emotion recognition.” in Interspeech, 2018, pp. 152–156.
  • [287] W. Lim, D. Jang, and T. Lee, “Speech emotion recognition using convolutional and recurrent neural networks,” in 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).   IEEE, 2016, pp. 1–4.
  • [288] D. Palaz, R. Collobert, and M. M. Doss, “Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks,” arXiv preprint arXiv:1304.1018, 2013.
  • [289] S. H. Kabil, H. Muckenhirn, and M. Magimai-Doss, “On learning to identify genders from raw speech signal using cnns.” in Interspeech, 2018, pp. 287–291.
  • [290] P. Golik, Z. Tüske, R. Schlüter, and H. Ney, “Convolutional neural networks for acoustic modeling of raw time signal in lvcsr,” in Sixteenth annual conference of the international speech communication association, 2015.
  • [291] W. Dai, C. Dai, S. Qu, J. Li, and S. Das, “Very deep convolutional neural networks for raw waveforms,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2017, pp. 421–425.
  • [292] X. Liu, “Deep convolutional and LSTM neural networks for acoustic modelling in automatic speech recognition,” 2018.
  • [293] N. Zeghidour, N. Usunier, I. Kokkinos, T. Schaiz, G. Synnaeve, and E. Dupoux, “Learning filterbanks from raw speech for phone recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 5509–5513.
  • [294] R. Zazo Candil, T. N. Sainath, G. Simko, and C. Parada, “Feature learning with raw-waveform cldnns for voice activity detection,” 2016.
  • [295] M. Ravanelli and Y. Bengio, “Speaker recognition from raw waveform with sincnet,” in 2018 IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2018, pp. 1021–1028.
  • [296] J.-W. Jung, H.-S. Heo, I.-H. Yang, H.-J. Shim, and H.-J. Yu, “Avoiding speaker overfitting in end-to-end dnns using raw waveform for text-independent speaker verification,” extraction, vol. 8, no. 12, pp. 23–24, 2018.
  • [297] M. Ravanelli and Y. Bengio, “Speaker recognition from raw waveform with SincNet,” in 2018 IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2018, pp. 1021–1028.
  • [298] J.-W. Jung, H.-S. Heo, I.-H. Yang, H.-J. Shim, and H.-J. Yu, “A complete end-to-end speaker verification system using deep neural networks: From raw signals to verification result,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 5349–5353.
  • [299] M. Sarma, P. Ghahremani, D. Povey, N. K. Goel, K. K. Sarma, and N. Dehak, “Improving emotion identification using phone posteriors in raw speech waveform based dnn,” Proc. Interspeech 2019, pp. 3925–3929, 2019.
  • [300] H. Lu, S. King, and O. Watts, “Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis,” in Eighth ISCA Workshop on Speech Synthesis, 2013.
  • [301] D. Hau and K. Chen, “Exploring hierarchical speech representations with a deep convolutional neural network,” UKCI 2011 Accepted Papers, p. 37, 2011.
  • [302] Y.-A. Chung, W.-N. Hsu, H. Tang, and J. Glass, “An unsupervised autoregressive model for speech representation learning,” arXiv preprint arXiv:1904.03240, 2019.
  • [303] Y.-A. Chung, C.-C. Wu, C.-H. Shen, H.-Y. Lee, and L.-S. Lee, “Audio word2vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder,” arXiv preprint arXiv:1603.00982, 2016.
  • [304] T. Ishii, H. Komiyama, T. Shinozaki, Y. Horiuchi, and S. Kuroiwa, “Reverberant speech recognition based on denoising autoencoder.” in Interspeech, 2013, pp. 3512–3516.
  • [305] B. Xia and C. Bao, “Speech enhancement with weighted denoising auto-encoder.” in INTERSPEECH, 2013, pp. 3444–3448.
  • [306] Z. Zhang, L. Wang, A. Kai, T. Yamada, W. Li, and M. Iwahashi, “Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2015, no. 1, p. 12, 2015.
  • [307] A. van den Oord, O. Vinyals et al., “Neural discrete representation learning,” in Advances in Neural Information Processing Systems, 2017, pp. 6306–6315.
  • [308] W.-N. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Y. Wang, Y. Cao, Y. Jia, Z. Chen, J. Shen et al., “Hierarchical generative modeling for controllable speech synthesis,” arXiv preprint arXiv:1810.07217, 2018.
  • [309] M. Pal, M. Kumar, R. Peri, T. J. Park, S. H. Kim, C. Lord, S. Bishop, and S. Narayanan, “Speaker diarization using latent space clustering in generative adversarial network,” arXiv preprint arXiv:1910.11398, 2019.
  • [310] N. E. Cibau, E. M. Albornoz, and H. L. Rufiner, “Speech emotion recognition using a deep autoencoder,” Anales de la XV Reunion de Procesamiento de la Informacion y Control, vol. 16, pp. 934–939, 2013.
  • [311] S. E. Eskimez, Z. Duan, and W. Heinzelman, “Unsupervised learning approach to feature analysis for automatic speech emotion recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 5099–5103.
  • [312] S. Sahu, R. Gupta, and C. Espy-Wilson, “Modeling feature representations for affective speech using generative adversarial networks,” arXiv preprint arXiv:1911.00030, 2019.
  • [313] Y.-A. Chung, Y. Wang, W.-N. Hsu, Y. Zhang, and R. Skerry-Ryan, “Semi-supervised training for improving data efficiency in end-to-end speech synthesis,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 6940–6944.
  • [314] Y. Zhao, S. Takaki, H.-T. Luong, J. Yamagishi, D. Saito, and N. Minematsu, “Wasserstein GAN and waveform loss-based acoustic model training for multi-speaker text-to-speech synthesis systems using a wavenet vocoder,” IEEE Access, vol. 6, pp. 60 478–60 488, 2018.
  • [315] Z. Huang, S. M. Siniscalchi, and C.-H. Lee, “A unified approach to transfer learning of deep neural networks with applications to speaker adaptation in automatic speech recognition,” Neurocomputing, vol. 218, pp. 448–459, 2016.
  • [316] P. Swietojanski, A. Ghoshal, and S. Renals, “Unsupervised cross-lingual knowledge transfer in DNN-based LVCSR,” in 2012 IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2012, pp. 246–251.
  • [317] S. Latif, R. Rana, S. Younis, J. Qadir, and J. Epps, “Cross corpus speech emotion classification-an effective transfer learning technique,” arXiv preprint arXiv:1801.06353, 2018.
  • [318] M. Neumann et al., “Cross-lingual and multilingual speech emotion recognition on english and french,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 5769–5773.
  • [319] M. Abdelwahab and C. Busso, “Domain adversarial for acoustic emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 12, pp. 2423–2435, 2018.
  • [320] M. Tu, Y. Tang, J. Huang, X. He, and B. Zhou, “Towards adversarial learning of speaker-invariant representation for speech emotion recognition,” arXiv preprint arXiv:1903.09606, 2019.
  • [321] J. Deng, R. Xia, Z. Zhang, Y. Liu, and B. Schuller, “Introducing shared-hidden-layer autoencoders for transfer learning and their application in acoustic emotion recognition,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2014, pp. 4818–4822.
  • [322] J. Deng, S. Frühholz, Z. Zhang, and B. Schuller, “Recognizing emotions from whispered speech based on acoustic feature transfer learning,” IEEE Access, vol. 5, pp. 5235–5246, 2017.
  • [323] Y. Xu, J. Du, Z. Huang, L.-R. Dai, and C.-H. Lee, “Multi-objective learning and mask-based post-processing for deep neural network based speech enhancement,” arXiv preprint arXiv:1703.07172, 2017.
  • [324] M. L. Seltzer and J. Droppo, “Multi-task learning in deep neural networks for improved phoneme recognition,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.   IEEE, 2013, pp. 6965–6969.
  • [325] T. Tan, Y. Qian, D. Yu, S. Kundu, L. Lu, K. C. Sim, X. Xiao, and Y. Zhang, “Speaker-aware training of LSTM-RNNs for acoustic modelling,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2016, pp. 5280–5284.
  • [326] X. Li and X. Wu, “Modeling speaker variability using long short-term memory networks for speech recognition,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
  • [327] Z. Tang, L. Li, D. Wang, R. Vipperla, Z. Tang, L. Li, D. Wang, and R. Vipperla, “Collaborative joint training with multitask recurrent model for speech and speaker recognition,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 25, no. 3, pp. 493–504, 2017.
  • [328] Y. Qian, J. Tao, D. Suendermann-Oeft, K. Evanini, A. V. Ivanov, and V. Ramanarayanan, “Noise and metadata sensitive bottleneck features for improving speaker recognition with non-native speech input.” in INTERSPEECH, 2016, pp. 3648–3652.
  • [329] N. Chen, Y. Qian, and K. Yu, “Multi-task learning for text-dependent speaker verification,” in Sixteenth annual conference of the international speech communication association, 2015.
  • [330] S. Yadav and A. Rai, “Learning discriminative features for speaker identification and verification.” in Interspeech, 2018, pp. 2237–2241.
  • [331] Z. Tang, L. Li, and D. Wang, “Multi-task recurrent model for speech and speaker recognition,” in 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).   IEEE, 2016, pp. 1–4.
  • [332] R. Xia and Y. Liu, “A multi-task learning framework for emotion recognition using 2D continuous space,” IEEE Transactions on Affective Computing, vol. 8, no. 1, pp. 3–14, 2015.
  • [333] Y. Zhang, Y. Liu, F. Weninger, and B. Schuller, “Multi-task deep neural network with shared hidden layers: Breaking down the wall between emotion representations,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2017, pp. 4990–4994.
  • [334] F. Ma, W. Gu, W. Zhang, S. Ni, S.-L. Huang, and L. Zhang, “Speech emotion recognition via attention-based DNN from multi-task learning,” in Proceedings of the 16th ACM Conference on Embedded Networked Sensor Systems.   ACM, 2018, pp. 363–364.
  • [335] Z. Zhang, B. Wu, and B. Schuller, “Attention-augmented end-to-end multi-task learning for emotion prediction from speech,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 6705–6709.
  • [336] F. Eyben, M. Wöllmer, and B. Schuller, “A multitask approach to continuous five-dimensional affect sensing in natural speech,” ACM Transactions on Interactive Intelligent Systems (TiiS), vol. 2, no. 1, p. 6, 2012.
  • [337] D. Le, Z. Aldeneh, and E. M. Provost, “Discretized continuous speech emotion recognition with multi-task deep recurrent neural network,” Interspeech, 2017 (to apear), 2017.
  • [338] H. Li, M. Tu, J. Huang, S. Narayanan, and P. Georgiou, “Speaker-invariant affective representation learning via adversarial training,” arXiv preprint arXiv:1911.01533, 2019.
  • [339] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep networks,” in Advances in neural information processing systems, 2015, pp. 2377–2385.
  • [340] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
  • [341] A. M. Saxe, J. L. McClelland, and S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks,” arXiv preprint arXiv:1312.6120, 2013.
  • [342] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
  • [343]

    K. Jia, D. Tao, S. Gao, and X. Xu, “Improving training of deep neural networks via singular value bounding,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4344–4352.
  • [344] T. Salimans and D. P. Kingma, “Weight normalization: A simple reparameterization to accelerate training of deep neural networks,” in Advances in Neural Information Processing Systems, 2016, pp. 901–909.
  • [345] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
  • [346] H. Li, B. Baucom, and P. Georgiou, “Unsupervised latent behavior manifold learning from acoustic features: Audio2behavior,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2017, pp. 5620–5624.
  • [347] Y. Gong and C. Poellabauer, “Towards learning fine-grained disentangled representations from speech,” arXiv preprint arXiv:1808.02939, 2018.
  • [348] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.
  • [349] M. Arjovsky and L. Bottou, “Towards principled methods for training generative adversarial networks. arxiv,” 2017.
  • [350] K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann, “Stabilizing training of generative adversarial networks through regularization,” in Advances in neural information processing systems, 2017, pp. 2018–2028.
  • [351] S. Asakawa, N. Minematsu, and K. Hirose, “Automatic recognition of connected vowels only using speaker-invariant representation of speech dynamics,” in Eighth Annual Conference of the International Speech Communication Association, 2007.
  • [352] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
  • [353] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami, “The limitations of deep learning in adversarial settings,” in 2016 IEEE European Symposium on Security and Privacy (EuroS&P).   IEEE, 2016, pp. 372–387.
  • [354] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: a simple and accurate method to fool deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2574–2582.
  • [355] N. Carlini and D. Wagner, “Audio adversarial examples: Targeted attacks on speech-to-text,” in 2018 IEEE Security and Privacy Workshops (SPW).   IEEE, 2018, pp. 1–7.
  • [356] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., “Deep speech: Scaling up end-to-end speech recognition,” arXiv preprint arXiv:1412.5567, 2014.
  • [357] M. Alzantot, B. Balaji, and M. Srivastava, “Did you hear that? adversarial examples against automatic speech recognition,” arXiv preprint arXiv:1801.00554, 2018.
  • [358] L. Schönherr, K. Kohls, S. Zeiler, T. Holz, and D. Kolossa, “Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding,” arXiv preprint arXiv:1808.05665, 2018.
  • [359] N. Akhtar and A. Mian, “Threat of adversarial attacks on deep learning in computer vision: A survey,” IEEE Access, vol. 6, pp. 14 410–14 430, 2018.
  • [360] S. Yin, C. Liu, Z. Zhang, Y. Lin, D. Wang, J. Tejedor, T. F. Zheng, and Y. Li, “Noisy training for deep neural networks in speech recognition,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2015, no. 1, p. 2, 2015.
  • [361]

    F. Bu, Z. Chen, Q. Zhang, and X. Wang, “Incomplete big data clustering algorithm using feature selection and partial distance,” in

    2014 5th International Conference on Digital Home.   IEEE, 2014, pp. 263–266.
  • [362] R. Wang and D. Tao, “Non-local auto-encoder with collaborative stabilization for image restoration,” IEEE Transactions on Image Processing, vol. 25, no. 5, pp. 2117–2129, 2016.
  • [363] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python,” in Proceedings of the 14th python in science conference, vol. 8, 2015.
  • [364] T. Giannakopoulos, “pyaudioanalysis: An open-source python library for audio signal analysis,” PloS one, vol. 10, no. 12, p. e0144610, 2015.
  • [365] F. Eyben, M. Wöllmer, and B. Schuller, “OpenSMILE – the Munich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th ACM international conference on Multimedia.   ACM, 2010, pp. 1459–1462.
  • [366] P. Lamere, P. Kwok, E. Gouvea, B. Raj, R. Singh, W. Walker, M. Warmuth, and P. Wolf, “The cmu sphinx-4 speech recognition system,” in IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP 2003), Hong Kong, vol. 1, 2003, pp. 2–5.
  • [367] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding, no. CONF.   IEEE Signal Processing Society, 2011.
  • [368] A. Lee, T. Kawahara, and K. Shikano, “Julius—an open source real-time large vocabulary recognition engine,” 2001.
  • [369] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen et al., “Espnet: End-to-end speech processing toolkit,” arXiv preprint arXiv:1804.00015, 2018.
  • [370] P. C. Woodland, J. J. Odell, V. Valtchev, and S. J. Young, “Large vocabulary continuous speech recognition using htk.” in ICASSP (2), 1994, pp. 125–128.
  • [371] A. Larcher, J.-F. Bonastre, B. Fauve, K. Lee, C. Lévy, H. Li, J. Mason, and J.-Y. Parfait, “Alize 3.0-open source toolkit for state-of-the-art speaker recognition,” 2013.
  • [372] F. Eyben, M. Wöllmer, and B. Schuller, “Openear—introducing the munich open-source emotion and affect recognition toolkit,” in 2009 3rd international conference on affective computing and intelligent interaction and workshops.   IEEE, 2009, pp. 1–6.
  • [373] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-datacenter performance analysis of a tensor processing unit,” in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).   IEEE, 2017, pp. 1–12.
  • [374] F. Arute, K. Arya, R. Babbush, D. Bacon, J. C. Bardin, R. Barends, R. Biswas, S. Boixo, F. G. Brandao, D. A. Buell et al., “Quantum supremacy using a programmable superconducting processor,” Nature, vol. 574, no. 7779, pp. 505–510, 2019.
  • [375] J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and A. Roberts, “Gansynth: Adversarial neural audio synthesis,” arXiv preprint arXiv:1902.08710, 2019.
  • [376] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brebisson, Y. Bengio, and A. Courville, “Melgan: Generative adversarial networks for conditional waveform synthesis,” arXiv preprint arXiv:1910.06711, 2019.
  • [377] R. Chesney and D. K. Citron, “Deep fakes: a looming challenge for privacy, democracy, and national security,” 2018.
  • [378]

    J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in

    Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232.
  • [379] P. S. Nidadavolu, S. Kataria, J. Villalba, and N. Dehak, “Low-resource domain adaptation for speaker recognition using cycle-gans,” arXiv preprint arXiv:1910.11909, 2019.
  • [380] F. Bao, M. Neumann, and N. T. Vu, “Cyclegan-based emotion style transfer as data augmentation for speech emotion recognition,” Manuscript submitted for publication, pp. 35–37, 2019.
  • [381] V. Thomas, E. Bengio, W. Fedus, J. Pondard, P. Beaudoin, H. Larochelle, J. Pineau, D. Precup, and Y. Bengio, “Disentangling the independently controllable factors of variation by interacting with the world,” arXiv preprint arXiv:1802.09484, 2018.
  • [382] M. A. Pathak, B. Raj, S. D. Rane, and P. Smaragdis, “Privacy-preserving speech processing: cryptographic and string-matching frameworks show promise,” IEEE signal processing magazine, vol. 30, no. 2, pp. 62–74, 2013.
  • [383] B. M. L. Srivastava, A. Bellet, M. Tommasi, and E. Vincent, “Privacy-preserving adversarial representation learning in asr: Reality or illusion?” Proc. INTERPSPEECH, pp. 3700–3704, 2019.
  • [384] M. Jaiswal and E. M. Provost, “Privacy enhanced multimodal neural representations for emotion recognition,” arXiv preprint arXiv:1910.13212, 2019.
  • [385] R. Shokri and V. Shmatikov, “Privacy-preserving deep learning,” in Proceedings of the 22nd ACM SIGSAC conference on computer and communications security.   ACM, 2015, pp. 1310–1321.