M2H-GAN: A GAN-based Mapping from Machine to Human Transcripts for Speech Understanding

04/13/2019 ∙ by Titouan Parcollet, et al. ∙ Université d'Avignon et des Pays de Vaucluse 0

Deep learning is at the core of recent spoken language understanding (SLU) related tasks. More precisely, deep neural networks (DNNs) drastically increased the performances of SLU systems, and numerous architectures have been proposed. In the real-life context of theme identification of telephone conversations, it is common to hold both a human, manual (TRS) and an automatically transcribed (ASR) versions of the conversations. Nonetheless, and due to production constraints, only the ASR transcripts are considered to build automatic classifiers. TRS transcripts are only used to measure the performances of ASR systems. Moreover, the recent performances in term of classification accuracy, obtained by DNN related systems are close to the performances reached by humans, and it becomes difficult to further increase the performances by only considering the ASR transcripts. This paper proposes to distillates the TRS knowledge available during the training phase within the ASR representation, by using a new generative adversarial network called M2H-GAN to generate a TRS-like version of an ASR document, to improve the theme identification performances.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Spoken language understanding (SLU) has been massively impacted by machine learning (ML) algorithms, and more precisely by deep neural networks (DNNs). Interesting solutions have been therefore proposed for SLU in human-computer and human-human dialogues

[1, 2, 3, 4]. An important component of this domain area is the task of topic identification in text documents [5]

. As an example, this paper deals with customer care services (CCS) in which an agent interact with a customer to address her/his concerns and to provide a solution. The automatic system is expected to correctly identifies the major theme of the conversation from the transcriptions obtained by an automatic speech recognition system (ASR), or by human manual transcripts (TRS). Unfortunately, TRS versions are only available at training time due to the fact that the production environment is automated, and relies on ASR transcripts. To address this problem, various neural architectures have been developed to directly classify the ASR transcriptions of telephone conversations, based on multi-layer perceptrons and pre-trained deep neural networks

[6, 7]

, or convolutional neural networks

[8]. Nonetheless, while such models are powerful, they are also limited by the quality of the ASR transcriptions. [9, 10] proposed to use the knowledge available at training time through the TRS transcriptions to enhance the input representation of the ASR versions. This enhancement is made possible with the use of stacked and deep stacked auto-encoders to learn a static mapping that projects the ASR latent space to the TRS one. Based on the promising results observed with this approach, we propose to further investigate the distillation of the TRS knowledge to the ASR representation with the recent generative adversarial networks (GAN).

GANs are an active field of research and offer an interesting approach that focuses on a game-theoretic method to train a generative model [11]. Numerous architectures have been proposed to address various tasks [12, 13]

. From a simplified perspective, GANs are commonly used to learn a mapping from a random noise space to a target one, making it possible to generate new unseen samples. In natural language processing (NLP) tasks, the noise space is commonly replaced with a well defined input representation, such as text written in a specific language for neural machine translation

[14]. Then, GANs are used to project this latent representation to a different target language. In the task of theme identification of telephone conversations investigated in this paper, we consider the latent ASR transcription as the noise space, and the TRS versions as the target one. After training, the model is expected to enhance the ASR latent representation with TRS knowledge, to further improve the results when classifying the documents.

This paper proposes a task adapted model called Machine-to-Human GAN (M2H-GAN) by merging the GAN with a semi-supervised GAN (SGAN), to better represent and classify telephone conversations. Therefore, the contributions of the paper are:

  • Introduce a new GAN architecture called M2H-GAN to efficiently map the automatically transcribed representation of conversations, to a latent representation of their manually transcribed version (Section 3).

  • Compare the classification accuracy obtained with this new representation to other methods on a theme identification of telephone conversations task (Section 4).

The experiments conduced on the DECODA [15]

dataset show that the M2H-GAN reduces the performances gap in term of classification accuracy between automatically and manually transcribed documents by learning a robust mapping between the two latent sub-spaces. The M2H-GAN also offers a more stable classification process with a lowered standard deviation with respect to results observed.

Figure 1: Illustration of the M2H-GAN architecture at training (top) and testing (bottom) time. Red and blue lines show the ASR and TRS representation signal. Note that the output of the generator goes from red to blue during the training phase.

2 Related work

Generative neural models are a specific and active domain area in the machine learning field. Recently, generative adversarial networks (GANs) [11]

received an astonishing interest due to the remarkable results obtained in computer vision

[16, 12, 17, 18]. The ability of GANs to generate samples that are closely-related to targets ones has also been extended to natural language processing (NLP) tasks, such as text and dialogue generation [19, 20, 21], or neural machine translation [14, 13]. To the best of our knowledge, [13] is the most related work to the problem addressed in this paper. Indeed, [13] proposed a model aiming to generate sentences which are hard to be discriminated from human-translated sentences. Consequently, the GAN model is expected to learn a mapping from one language to another one based on human manual translations. The task considered in this paper replaces the initial language by an automatic transcription of a conversation, and the target language by its manual transcription. Furthermore, we propose to perform classification on top of the generation, as driven by the semi-supervised GAN (SGAN) approach [22]. Nonetheless, our model uses a different architecture that does not take into consideration the target classes when generating the samples, since golden-targets (i.e manual transcriptions) are not available at testing time.

3 Generative neural models

In this paper a basic GAN is merged with the semi-supervised SGAN (Section 3.1) to allow a projection of an automatically transcribed document, to its manual transcription representation with the Machine-to-Human GAN (M2H-GAN, Section 3.2).

3.1 Generative Adversarial Networks

In a generative adversarial network [11], two neural networks are trained in opposition. First, a generator outputs a fake object named

from an input random noise vector



Then, a discriminator receives alternatively a true sample or a fake one from

, and outputs a probability distribution of the input being a fake or not. During training,

tries to maximize the log-likelihood of the correctly assigned source:


In the same manner, is trained to fool by minimizing the second term of Eq. 2. Indeed, reducing the probability of correct classification of fake inputs increases the generating capability of .

Auxiliary and semi-supervised GANs [23, 22] have been proposed to take into consideration the labels in both the generator and the discriminator to drive the generation process toward a specific class. In an SGAN, is trained to determine if the input signal is fake or of a certain label. Consequently, the output dimension of is of size with being the number of classes, and representing the fake

case. The loss function remains unchanged. SGANs use labels to add a condition on the generation process, making it possible to generate samples of a specific class, such as

car or bird for image generation.

3.2 Machine-to-Human representation with generative models

We propose to merge the initial GAN with its semi-supervised version SGAN, in a model named Machine-to-Human GAN(M2H-GAN). An overview of the M2H-GAN architecture is depicted in Figure 1. In M2H-GAN, is the generated representation of an automatically transcribed document (ASR) from , and is the “clean” TRS version of the same sample. is trained to determine if the input has been generated, or belongs to a certain class (SGAN), and thus contains

output neurons. Consequently,

is jointly trained to map the ASR representation to a latent TRS and “clean” representation, in order to fool the discriminator. Unlike for SGAN, the generator of M2H-GAN does not have access to the label, due to the fact that conversations classes are unknown during the testing phase. This modification allows the discriminator to have more room to discover if an input is fake or not, making it more powerful. As a consequence, the generator must create a more convincing representation of the ASR signal, and receives gradients according to the label, without any conditioning on the input. An overview of M2H-GAN is depicted in Figure 1.

4 Experiments

This section introduces the theme classification of telephone conversations task with the DECODA dataset (Section 4.1), alongside with the proposed representation of the document (Section 4.2). The investigated architectures are detailled in Section 4.3, while the observed results are reported in Section 4.4.

4.1 Spoken conversations dataset

The corpus of spoken conversations is a set of automatically transcribed and annotated human-human telephone conversations of the Paris transportation system CCS (RATP). This corpus comes from the first version of the DECODA project [15] and is employed to evaluate the effectiveness of the proposed M2H-GAN on a conversation theme identification task. The DECODA corpus is composed of telephone conversations recorded during high traffics days in the capital, which is equivalent to about hours of signal. The dataset was split into categories or dominant themes that are detailed in Table 1. An example of a manually transcribed conversation of DECODA is given in Figure 2.

Figure 2: Example of a human transcription of a dialogue from the DECODA corpus for the SLU task of theme identification.

It is important to highlight the difficulty of the classification task due to the close sub-topics that can occur within a conversation. Indeed, a customer can ask for the price of a transportation card after a loss, and the document will be assigned to transportation cards, while the vocabulary is also closely related to lost and found. Furthermore, high word error rates (WER) are reported on the ASR transcripts with the LIA-Speeral ASR system [24], due to very difficult and noisy environments including streets, buses and metros. Indeed, WERs of %, % and % are obtained on the training, validation and test sets respectively. Considering the high WERs and the closely related sub-topics within a document, it is crucial to introduce the clean and manual transcription of the conversation information to the training process, to build better classification systems.

Class Number of samples
label training development testing
problems of itinerary 145 44 67
lost and found 143 33 63
time schedules 47 7 18
transportation cards 106 24 47
state of the traffic 202 45 90
fares 19 9 11
infractions 47 4 18
special offers 31 9 13
Total 740 175 327
Table 1: DECODA dataset.

4.2 Abstract document representation with LDA

The latent Dirichlet allocation or LDA is an effective method to represent documents in an unsupervised manner, as probability distributions of hidden topics [25] in a document, and have shown their efficiency in many previous related works [26, 6]

. For the experiments described in this section, LDA models are trained over the training set of DECODA following the standard hyper-parameters heuristic

[25]. It is important to note that two LDA models are trained with either the ASR or TRS conversations from the training sub-set of the DECODA data-set. Consequently, , with the number of topics, and . The number has been previously investigated for this task in [6, 7], and is set to . More precisely, runs of the LDA model are concatenated to obtain a final vector of size , to alleviate any variations. Then, every conversation is projected into the corresponding LDA space, and is embedded in a vector of size .

4.3 Experimental protocol

To evaluate the effectiveness of the M2H-GAN to generate TRS-like representations of ASR transcripts, we compare M2H-GAN to a GAN model on the theme classification of telephone conversations. Deep feed-forward NNs trained on TRS and ASR transcripts are used as baselines. We also compare M2H-GAN to previously investigated generative models [10]. Training and testing steps are detailed in Algorithm 1, and can be summarized as follows: 1) Train GAN or M2H-GAN models; 2) Freeze the generator and train a DNN classifier from the generated features. Finally, Figure 1 represents the global architecture of the model.

1:procedure Train GANs(,)
2:     Project , in LDA to obtain , .
3:     Generate with from .
4:     Train and based on , . [11].
5:procedure Train DNNs()
6:     Project in LDA to obtain .
7:     Generate with frozen from .
8:     Train a DNN to classify .
Algorithm 1 Training procedures.

DNNs. Classifiers rely on hidden layers of size with activations, and a final softmax layer corresponding to the themes of the DECODA dataset [15]. They are trained during epochs based on the Adam optimizer [27] with vanilla hyper-parameters and no regularization techniques. After training, the maximum accuracy obtained on the test, alongside with the best result w.r.t to the best validation performances are saved.

GAN. The generator is made of hidden layers of size and (corresponding to the size of the LDA vector) with layer-wise normalization [28] and activations, while the discriminator is composed of layers of and neurons with and sigmoidactivation functions. The discriminating labels are smoothed by being sampled from a uniform distribution bounded by for the valid ones, and by for the fake ones, as proposed in [29].

M2H-GAN. The generator is identical to the GAN baseline. The discriminator also includes a semi-supervised classification task. Consequently, the output layer is made of neurons for the themes of the DECODA framework plus the FAKE label.

Both GAN and M2H-GAN generators are trained to minimize the binary cross-entropy loss observed with the discriminator predictions on their fake generated features, while their discriminators maximize the binary and traditional cross-entropy loss functions of correctly classified sources. Finally, models are trained in an adversarial manner as proposed in [11] during epochs with SGD, no momentum, and with a learning rate set to .

4.4 Results

Two baselines DNN classifiers (Section 4.3) are tested on both the ASR and TRS versions of the DECODA corpus. Then, GAN-based approaches are trained following Algorithm 1. All the experiments are performed times and averaged, to alleviate variations due to the random initialization of the parameters.

Table 2 reports the average accuracies observed with the GAN, and the more adapted M2H-GAN approaches compared to simple DNN classifiers on the DECODA task. It is first important to note the difference in term of accuracy, between the two baselines during the theme identification on both ASR and TRS transcripts. Indeed, while the standard deviation remains almost equal, both real (w.r.t to the validation set) and max test accuracies are different. More precisely, the ASR-based DNN obtains a real test accuracy of % compared to % for the TRS-based DNN, representing a drop of %. This is easily explained by the high WER observed on the ASR transcriptions (Section 4.1), that alter significantly the LDA representation and the final classification performances. These results support the initial intuition that a translation of ASR documents to TRS-like representations allow us to better identify the most related theme of a spoken dialogue.

Models Data Dev. Real Test Max Test Std. Dev.
DNN TRS 92.5 88.0 88.5 0.016
DNN ASR 89.5 83.4 84.6 0.017
GAN ASR 87.0 84.1 85.2 0.012
M2H-GAN ASR 90.0 85.5 85.8 0.007
Table 2: Accuracies obtained by various models on the DECODA corpus. “Real Test” stands for the performances observed on the test set w.r.t to the validation set, while “Max Test” are the best results obtained. Results are averaged over runs. The standard deviation is computed over these runs and concern the “Real Test” performances.

As a first step to reduce this gap, ASR transcripts inputs are mapped to the TRS ones with a GAN. This approach obtains a best test accuracy of % for ASR inputs, reducing the absolute difference with TRS performances to %. The standard deviation is also lowered to , resulting in a slightly more stable model. Validation performances are altered with an average accuracy of % compared to % and % for the DNNs trained on the ASR and TRS respectively.

The Machine-to-Human mapping is then performed with the M2H-GAN. The real test accuracy is increased to %, representing a absolute gain of % and % compared to the simpler GAN and DNN classifier respectively. The gap between the ASR classification performances and the TRS ones is also reduced to %. It is also worth underlying that the standard deviation is halved () in comparison of all the other models, resulting in a more robust representation of the spoken document content.

Models Data Dev. Real Test Max Test Std. Dev.
AE[10] ASR - 81 - -
DAE[10] ASR - - 74.3 -
DSAE[10] ASR 88.0 82.0 83.0 -
QDAE[30] ASR 90.0 85.2 85.2 -
GAN ASR 87.0 84.1 85.2 0.012
M2H-GAN ASR 90.0 85.5 85.8 0.007
Table 3: Accuracies obtained by proposed generative models, compared to previous works on the DECODA corpus. “Real Test” stands for the performances observed on the test set w.r.t to the validation set, while “Max Test” are the best results obtained. Results are averaged over runs. The standard deviation is computed over these runs and concerns the “Real Test” performances

Table 3 shows the results observed with GAN and M2H-GAN models compared to previously investigated generative models. Both GAN and M2H outperform the auto-encoders (AE), denoising auto-encoders (DAE), and deep stacked auto-encoders (DSAE) proposed in [10, 9]. Indeed, encoder and decoder are trained jointly to minimize the reconstruction error, while the generator and discriminator are trained on different objectives impacting on each other. M2H-GAN also give better results than the recent quaternion-valued denoising auto-encoder (QDAE), despite the fact that the QDAE is based on a better document representation and a specific segmentation with the quaternion algebra.

5 Conclusions

Summary. This paper proposes to use the efficient generative adversarial networks to map an automatically transcribed telephone conversation, to a latent representation of its “clean” transcription from human, to better be classified by neural networks. The proposed M2H-GAN, derived from semi-supervised GANs, is compared to common DNN classifiers and a GAN architecture on a realistic task of theme identification of telephone conversations. The M2H-GAN raises the classification accuracy of the noisy ASR transcripts from % for a straightforward DNN to %. The absolute difference with manually transcribed document classification is therefore lowered to %. The model is also more robust with an halved standard deviation over the runs.

Future work. Generative adversarial networks suffer from the fact of being a recent domain area with fewer investigations compared to traditional methods. Therefore, a future work will consist in investigating dedicated GAN models, to better consider the structure of the documents, such as speech turns with recurrent models. Moreover, the instability of GANs training must be investigated in the specific context of noisy document.