Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation

05/17/2020 ∙ by Won Ik Cho, et al. ∙ NAVER Corp. Seoul National University 0

Speech is one of the most effective means of communication and is full of information that helps the transmission of utterer's thoughts. However, mainly due to the cumbersome processing of acoustic features, phoneme or word posterior probability has frequently been discarded in understanding the natural language. Thus, some recent spoken language understanding (SLU) modules have utilized an end-to-end structure that preserves the uncertainty information. This further reduces the propagation of speech recognition error and guarantees computational efficiency. We claim that in this process, the speech comprehension can benefit from the inference of massive pre-trained language models (LMs). We transfer the knowledge from a concrete Transformer-based text LM to an SLU module which can face a data shortage, based on recent cross-modal distillation methodologies. We demonstrate the validity of our proposal upon the performance on the Fluent Speech Command dataset. Thereby, we experimentally verify our hypothesis that the knowledge could be shared from the top layer of the LM to a fully speech-based module, in which the abstracted speech is expected to meet the semantic representation.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speech and text are two representative medium of language. Speech, which is delivered mainly via waveform, can be projected to text with the help of automatic speech recognition (ASR). On the contrary, the text is represented visually in letters and is easily digitized to Unicode. It is deemed a lot more beneficial to use text in language comprehension, due to its transmission of information being less uncertain.

Despite the shared semantic representation between those two [1], especially in engineering studies, they are treated as the data of different modality. In this regard, in contemporary speech-based natural language understanding (NLU) and slot filling tasks, main approaches have exploited either ASR-NLU pipeline [2] or end-to-end speech processing [3, 4, 5]. The former, which is conventional, is partially improvable and explainable, while the latter is in fashion since it can incapacitate the ASR errors that can be cascaded.

Figure 1: A brief architecture of the proposed distillation scheme on an end-to-end SLU module. The diagram on the right side is adopted from [6].

In this paper, we combine the two approaches in a cross-modal viewpoint. Given original speech, its ground truth script, and the target intent, we transfer knowledge from the inference process of the pre-trained language model (LM) to the speech understanding (Figure 1). The core idea is setting a meeting place for the representation from the acoustic data and that from the digitized text, in other words, where the phonetic and lexical data coincide in terms of semantics. In this way, we compensate for the roughness of low-level processing of speech engineering, at the same time reducing the amplification of ASR errors and benefiting from the text-based inference. The contribution of this study is as the followings:

  • Leveraging the high-performance inference of text-based fine-tuned LM to an end-to-end spoken language understanding via cross-modal knowledge distillation (KD)

  • Verifying the effect of KD with the performance on widely-used intent identification and slot-filling dataset

  • Suggesting the loss function and KD weight scheduling that can be effective in speech data shortage scenarios

2 Related Work

Comprehending the directive utterances in terms of intent argument has been vastly investigated so far, whether it be a text or speech input. While the systems with either input aim to execute similar tasks, the speech-based one inevitably requires more delicate handling that owes to the signal-level features.

2.1 Conventional pipeline

In conventional settings, spoken language understanding (SLU) is divided into ASR and NLU. ASR is a procedure that transcribes speech into text, and in NLU, the resulting text is analyzed to yield the intent arguments [2]. This cascade structure has also been widely used in other spoken language processing tasks, including speech translation [7] and intention understanding [8]. It is known and intuitive that the higher the precision of the ASR, the more significant the result is. However, as [9] pointed out in the recent study on speech translation, mainly three limitations lie in the pipeline: 1) time delay of cascade structure, 2) amplification of ASR errors, and 3) parameter redundancy caused by the separation of the modules.

2.2 End-to-end approaches

To cope with the disadvantages above, in up-to-date SLU, the inference has been performed in an end-to-end manner, wrapping up the ASR and NLU process. Advanced from the early approaches that directly infer the answer from signal level features [10] or jointly trains ASR and NLU components [3], recent ones use word posterior-level [4] or phoneme posterior-level [5] pre-trained modules to deal with the shortage of labeled speech resources. The amount of abstraction differs, but the approaches above share the ultimate goal of correctly inferring the argument, usually via slot-wise intent classification.

2.3 Pre-trained language models

Lately, a recurrent neural network (RNN)

[11] or Transformer [12]-based pre-trained LMs [13, 6] have shown powerful performances over various tasks. Moreover, task-wise training is available by just adding a shallow trainable layer on the top of the pre-trained module and undertaking a lightweight fine-tuning. However, so far, few end-to-end SLU approaches have taken advantage of them [14] mainly because the inference requires an explicitly text-format input, which necessitates an accurate ASR. Followingly, the task turns into a conventional pipeline problem, deterring the cross-modality.

2.4 Knowledge distillation of LMs

Though the above limitation is probable, it is a significant loss for the whole SLU inference to renounce the comprehensive and verified information processing of the pre-trained LMs. Is there any approach we can leverage the guaranteed performance? Knowledge distillation (KD) can be one solution [15]. It is widely used for model compression, but its scheme of minimizing the layer-wise difference can be adopted in the transfer [16] or cross-modal learning [17] as well. Notably for the Transformer [12]-based pre-trained LMs that occupy a massive volume, recent model compression work proposed condensation schemes adopting a simple RNN [18] or thinner Transformer layers [19]. In this paper, we plan to inherit them along with the philosophy of cross-modal distillation.

3 Proposed Method

The core content of our proposal is leveraging the pre-trained LM [6] to SLU via cross-modal fine-tuning, where the tuning is executed in the form of distillation [18, 19].

3.1 Motivation

In [1], it is demonstrated in detail how the spoken language and written one share knowledge in abstracting the features. Beyond the lexical features, which are a mere correspondence of a phoneme sequence, written language contains the tonal symbols (e.g., pinyin) or punctuation marks, which regard various prosodic features of the speech. Thus, we hypothesized that (1) the integration of both modalities affects a speech-based analysis in a positive way.

Consequently, we noted that it had been experimentally displayed that the text-level features reach a state-of-the-art performance within NLU tasks if combined with a pre-trained LM [6], while yet the speech-oriented models can get little from it. It is not unnatural to expect that (2) the speech processing can be boosted by NLU via some possible form of knowledge sharing.

In summary, taking into account (1) and (2), we aimed to transfer implicit linguistic processing in LMs (that can help understand the spoken language) to an SLU module, without an explicit process of speech-to-text adaptation.

3.2 Materialization

The next step is materializing the architecture. Here we refer to two kinds of key papers, namely cross-modal KD for speech translation [20] and LM compression [18, 19].

Cross-modal KD is an ambiguous term since it is difficult to define what the modality is. Thus, we here regard speech and text to incorporate different modality, though in our task, both lead to the same type of inference (intent understanding). Similar to [20]

, where a student speech translation model learns from the prediction of a teacher machine translation module, our SLU model takes advantage from the logit inference of a fine-tuned Transformer-based LM


In this process, we employ detailed compressing procedures of a Transformer LM [18], both regarding the model architecture and loss functions. At the very first phase, a pre-trained LM, e.g., bidirectional encoder representation from Transformers (BERT) [6], is fine-tuned with the ground truth, eventually making up a teacher model (though with different modality). Consequently, at the end-to-end SLU training phase, which utilizes a frozen pre-trained acoustic module [21, 4], the loss function is updated with the knowledge distilled from the teacher. Here, knowledge is a loss that represents the difference between both modules (parameter sets) regarding logit layers.

To wrap up, leveraging pre-trained LM to an end-to-end SLU in our approach includes LM fine tuning and distillation from LM to SLU.

3.3 Model construction

The final step is constructing the concrete structure of KD, where the teacher pre-trained LM [6] utilizes text input, and the student adopts a speech instance [4], while two share the same type of prediction [18]. In this process, we set rules of thumb to leverage the given structure and training resources as efficiently as possible. Since one of our aims is to make the best of verified ready-made solutions, we integrated the released structures, yielding the following specifications:

  • Backbone student model adopts ASR pre-trained module [21]

    and RNN-based intent classifier

    [4], which respectively yields word posterior sequence and slot-wise predictions.

  • For the teacher model, the pre-trained BERT is utilized without additional modification, and the fine-tuning only exploits freely available benchmarks.

  • In addition to the cross-entropy (CE) function that is used as the loss of an end-to-end SLU module, a KD loss is augmented to the total loss to transfer the influence of the teacher in the student training phase.

In sharing the knowledge, as mentioned above, the guidance transferred from the upper components of the fine-tuned BERT logit layers so that the student coincides with the representation that comes from the text input. We believe that unlike the raw-text-friendly input layers of LM, the upper layers are the parts where the abstracted textual information best meets the spoken features.

More specifically, the shared knowledge can be represented as a regulation (loss function) that the teacher model gives to the student in the training phase, which leads the tutee to a desirable direction. The notation for the total loss function is as follows:


where is a scheduling factor and . and , here denoted as KD weight, are hyper-parameters that decide the influence of and respectively, which can be either fixed or dynamically updated.

Detailed on the losses, is a CE between the answer labels and the predicted logits of the SLU component, as in (2), where is a logit representation and is the target label. is either a mean-squared error (MSE) or smoothed loss (MAE) between the predicted logits of SLU component and the fine-tuned BERT, adopted based on [18] and [22] respectively. In (3), determines the type of distance (e.g., MSE, MAE):


In BERT fine-tuning, we adopt two kinds of engineering to investigate the teacher models of diverse performance. For a less accurate one, we build a fully connected (FC) layer on the top of [CLS] representation of BERT [6], while for the stronger model, we set FC layers for all the output representations of BERT and then apply a pooling. We call the former teacher and the latter professor henceforth, considering the difference in training accuracy of both.

Furthermore, to leverage the teacher and professor model simultaneously, we mix up the loss that comes from each network, to make up a hybrid case as in (4):


where denotes only teacher and only professor. For , hybrid, we apply the batch-wise intent error rate, , inspired be [23]. This implies that the professor models teaches more than teacher for the challenging samples.

4 Experiment

4.1 Dataset

Following the previous end-to-end SLU papers [4, 5, 24], we use the Fluent Speech Command (FSC) dataset proposed in [4]. It incorporates 30,874 speech utterances annotated with three slots, namely action, object, and location. For example, for “Turn the lamp off.”, we have slots filled as {action: decrease, object: lamp, location: none}, while “Increase the temperature in the bedroom” fills the location slot.

We adopt this dataset for three reasons; first, the amount of speakers and speech utterances is substantial, and second, the corpus incorporates fairly complex query-answer pairs; total 248 phrasings with 31 unique intents. Above all, the dataset is publicly available. These qualifies the dataset for a benchmark, over other speech command datasets such as Google Speech Command111https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html or ATIS [25]. We arrange the distinguished features (Table 1), and the specification can be found in [4].

4.2 Implementation

In our experiment, we referred to three released implementations: (i) a full end-to-end SLU module utilizing FSC222https://github.com/lorenlugosch/end-to-end-SLU/, (ii) a freely available pre-trained BERT-Base333https://github.com/huggingface/transformers, and (iii) a process providing task-specific BERT-to-BiLSTM distillation444https://github.com/pvgladkov/knowledge-distillation. With (i) as a backbone, we distilled the thinking of (ii) to the RNN encoder-decoder of (i) in the training phase. The overall procedure follows (iii), which performs a (text-only) BERT-to-BiLSTM distillation and reaches quite a standard (e.g., over ELMo [13]).

4.2.1 Teacher training and baselines

Three types of systems are mainly considered. The first type is Pretrained LMs (BERT) fine-tuned with the ground truth (GT) script, which are the teachers that require an accurate script as an input. Teacher training was done with the whole FSC scripts, tokenized via word piece model tokenization [26]

of BERT-Base, maximum length 60. The convergence was reached before 50 epochs for all the teacher models.

For the teachers, if ASR output transcriptions are fed as input, we acquire the systems of the second type; an ASR-NLU pipeline, a common baseline. We did not re-train the ASR module with FSC, and instead used recently distributed Jasper [27] modules; one with high accuracy and the other with relatively lower, to check how the systems are sensitive to word errors.

The last type of models are speech-based ones: an RNN-based end-to-end that utilizes word-level posterior [4] and a phoneme posterior-based model with a permutation language model [5]. Unlike [4], which we train as well in our environment, for [5], the reported result was adopted from the original paper, especially the highest among all the settings. For the test of these models, only speech inputs are utilized.

(Utterances / Speakers)
Train (23,132 / 77)
Valid (3,118 / 10)
Test (3,793 / 10)
Slot specification
3 slots; (6, 14, 4) for
(action, object, location)
Unique intents 31 combinations
Table 1: Main features of the dataset.

4.2.2 The proposed

We compare the above approaches to the proposed scheme. As stated in Section 3.3, the whole process resembles [4], only with the difference in the total loss . Mainly three factors determine : who teaches, how the loss is calculated, and how much the guidance influences. The first one regards the source of distillation, namely teacher and professor. The second is upon , MSE or MAE. The last denotes the scheduling on and .

On the last topic, where and sets the KD weight, we perform three weight scheduling strategies regarding the temporal factor.


First one is the aforementioned , adopted as (5a), which depends upon the training intent error rate per batch. Qualitatively, it regards well-classified samples contribute more to the training, as suggested in [23]. Second one is the exponential decay (Exp.), calculated as (5b) where the teacher influence falls down exponentially and mechanically depending on the epoch. The rest is the triangular scheduling (Tri.) which is inspired by [28], defined as (5c) for and the maximum number of epochs. was multiplied to compensate for the scale of KD loss compared to CE. Unlike Exp. where the teacher warms up the parameters at the very early phase, in Tri., the student learns by itself at first and the teacher intervenes in the middle.

4.3 Result and analysis

Error rate (%) Input text type
Teacher models GT Jasper Jasper
3.74 (Train)
0.00 (Test)
0.19 (Train)
0.00 (Test)
Table 2: Performance of the teacher models. Jasper denotes the ASR model with high performance (low word error rate).

4.3.1 Teacher performance

Overall, it is verified that the BERT models show significance with the ground truth text (Table 2). Although teacher failed to reach the performance of some end-to-end SLU models in terms of training accuracy, the valid and test accuracy was proven impeccable; proving that the text-based systems face less uncertain representations in the training phase. Besides, professor far outperforms teacher in training accuracy. However, since the performance is not sustained in the ASR context, we set a baseline for ASR-NLU with the borrowed value [5] (Table 3).

4.3.2 Comparison and analysis

The results show that the distillation affects if the setting is considerate, 1.19 to 1.02 at best (Table 3). Some do not display the enhancement probably for the sensitivity of the test set, but we obtained the performance of BERT-PLM (1.05) [5] for certain cases, namely utilizing teacher and hybrid. Though we could not achieve the current best-known performance that adopts the structure of ERNIE [29] (0.98), one of ours reached slightly beyond BERT-PLM with MAE. We acquired around 15% reduced error rate via simple distillation to the vanilla SLU model.

It is notable that professor does not necessarily present the best teaching. It was also observed that the professor distillation spent much more epochs for the student to reach the fair accuracy in the training phase. In this regard, for data shortage scenario #1, even hybrid (where professor influence much) failed to converge, with err scheduling that had yielded the best performance. This implies that the distillation should be more like guidance, not just a harsh transfer, if the resource is scarce.

The decision of loss function is also the part we scrutinized in this study considering the previous research on Speech BERT [22]. It has been empirically shown that MAE can compensate for the different natures of the speech and text data. This is not significant in whole-data scenario (Table 3), where the overfitting is less probable. However, in data shortage scenarios, adopting MSE failed to guarantee the usefulness of distillation as a helper, inducing degradation or collapse (Table 4). We assume that this is a matter of the boosted scale of the loss, that comes from the different levels of uncertainty of both modalities, which appears even with MAE sometimes (scenario #2).

4.3.3 Data shortage scenarios and scheduling

Lastly, we checked that the proposed method is also effective in the case where the amount of text data dominates the speech, by restricting the usage of speech-text pairs to 10% and 1% in the training phase (Table 4). Given the identical test set for all the scenarios, the amount of error reduction became more visible as the data decreased. For instance with teacher, MAE, and Exp., we obtained 0.9 for whole-data, 0.16 for 10%, 0.44 for 1%).

In this phase, the scheduling influenced more than the case of whole-data. At first we suspected that err or Tri. would show the considerable performance. However, for the both shortage scenarios, exponential decay (Exp.) exhibited the significance compared to the others, given MAE and teacher distillation. This means that early influence and fading away can lead the student to better direction if the resource is not enough (Exp. err, Tri.). The teaching should be moderate (teacher hybrid), and the transfer of loss should be restricted in some circumstances (e.g., = err in scenario #2) to prevent the collapse.

We would like to conclude the analysis by summing up as follows:

  • Cross-modal distillation works, and more significant in speech data shortage scenarios.

  • Teacher with higher performance does not necessarily teach better, and may impede the convergence if the resource is scarce.

  • Loss function affects the result, but seems to be the matter of scale; instead, scheduling is more crucial given data shortage.

Test error rate (%) Reported & done
ASR-NLU (Reported by [5]) 9.89
Lugosch et al. [4] 1.20 / 1.19 (Done)
Wang et al. [5] 1.05 (BERT) / 0.98 (ERNIE)
= 0.1
= 0.5
= err
Distill-Teacher ( = 0) 1.19 1.19 1.05 1.18
Distill-Professor ( = 1) 1.18 1.19 1.13 1.18
Distill-Hybrid ( = err) 1.13 1.13 1.05 1.02
Table 3: Results of the whole-data scenario.
Test error rate (%)
MAE + Schedulings
err Exp. Tri.
Distill-Teacher ( = 0) 1.05 1.18 1.10 1.05
Distill-Professor ( = 1) 1.13 1.18 1.18 1.08
Distill-Hybrid ( = err) 1.05 1.02 1.08 1.08
Data shortage #1 10% (10 random subsets)
Lugosch et al. [4] 2.10 / 2.04 (Done)
Distill-Teacher ( = 0) 2.32 2.00 1.88 1.98
Distill-Hybrid ( = err) 2.06 2.01 1.98
Data shortage #2 1% (20 random subsets)
Lugosch et al. 17.22 (Done)
Distill-Teacher ( = 0) 16.88 17.27
Table 4: Distillation influences in the data shortage scenarios with various scheduling schemes. We set [4] as baseline for the shortage scenarios. denotes the failure of convergence.

4.4 Discussion

Despite the feasibility of distillation displayed in the cross-modal context, we think it is beneficial for our argument to discuss the result both from a theoretical and technical perspective.

4.4.1 Knowledge sharing

One may ask whether the distillation is truly a sharing of knowledge, since it can be interpreted as merely supervising the student based on relatively accurate logits. Also, in view of distribution, some outputs regarding confident inferences might be considered as the hard-labeled answer itself. However, in quite a few cases, logits can reflect the extent each problem is difficult for the teacher. We believe that such information is intertwined with the word-level posterior, which incorporates the uncertainty of speech processing as well.

4.4.2 Cases not covered here

In the overall passage, we have emphasized the advantages of gathering information from spoken and written language. However, strictly speaking, the corpus we adopted here does not necessarily deal with ambiguous sentences; they are clear directive commands. Instead, acoustic features such as pitch or duration, which are absent in the word- or phoneme-level posterior, might be crucial in determining the sentence meaning in some circumstances, e.g., for syntactically ambiguous utterances [30]

. In this regard, we infer that the practical research direction for more speech-oriented analysis should include either residual connection of acoustic features

[31] or prosodic segment embedding that might ameliorate the comprehension [32].

5 Conclusion

In this paper, we materialized the speech to text adaptation by an efficient cross-modal LM distillation on an intent classification and slot filling task, FSC. The overall distillation scheme and the implementation details (loss function, scheduling, etc.) for the given scenarios are expected to be practically meaningful to relevant researches. As future work, we plan to anatomy the layer-wise information hierarchy of pre-trained LMs that the SLU systems might leverage beyond logit-level representations.

6 Acknowledgements

This research was supported by NAVER Corp. The authors appreciate Hyoungseok Kim, Gichang Lee, and Woomyoung Park for constructive discussion. Also, the authors greatly thank Sang-Woo Lee, Kyoung Tae Doh, and Jung-Woo Ha for helping this research.