Log In Sign Up

Few-shot learning with attention-based sequence-to-sequence models

by   Bertrand Higy, et al.

End-to-end approaches have recently become popular as a means of simplifying the training and deployment of speech recognition systems. However, they often require large amounts of data to perform well on large vocabulary tasks. With the aim of making end-to-end approaches usable by a broader range of researchers, we explore the potential to use end-to-end methods in small vocabulary contexts where smaller datasets may be used. A significant drawback of small-vocabulary systems is the difficulty of expanding the vocabulary beyond the original training samples -- therefore we also study strategies to extend the vocabulary with only few examples per new class (few-shot learning). Our results show that an attention-based encoder-decoder can be competitive against a strong baseline on a small vocabulary keyword classification task, reaching 97.5 shows promising results on the few-shot learning problem where a simple strategy achieved 34.8 each new class. This score goes up to 80.3


page 1

page 2

page 3

page 4


'Less Than One'-Shot Learning: Learning N Classes From M<N Samples

Deep neural networks require large training sets but suffer from high co...

Improved training of end-to-end attention models for speech recognition

Sequence-to-sequence attention-based models on subword units allow simpl...

Few-Shot Drum Transcription in Polyphonic Music

Data-driven approaches to automatic drum transcription (ADT) are often l...

A Few Shot Multi-Representation Approach for N-gram Spotting in Historical Manuscripts

Despite recent advances in automatic text recognition, the performance r...

Dynamic Input Structure and Network Assembly for Few-Shot Learning

The ability to learn from a small number of examples has been a difficul...

AdaDurIAN: Few-shot Adaptation for Neural Text-to-Speech with DurIAN

This paper investigates how to leverage a DurIAN-based average model to ...

EfficientWord-Net: An Open Source Hotword Detection Engine based on One-shot Learning

Voice assistants like Siri, Google Assistant, Alexa etc. are used widely...

1 Introduction

Vocal interfaces are becoming more and more popular as our devices (e.g. smartphones, tablets or more recently smart speakers) are becoming more intelligent. Speech is an intuitive and effective way to transmit commands, which makes it very appealing. However, the complexity of modern speech recognition technology and the difficulty of gathering the necessary data can make it hard for single individuals or small companies to develop their own systems, even for small vocabulary recognition. This paper explores strategies to train a keyword/command recognition system by trading off the size of the vocabulary against the quantity of data available. In other words, we focus on low resource, small vocabulary tasks, with a strong bias toward simplicity. Our motivation is similar to that behind the Google Tensorflow team’s release of the Speech Commands (SC) dataset [1] and the organization of an accompanying challenge111

Together with the SC dataset, Google released a baseline classification system222

which was used as a starting point by many challenge participants. To enable a simple classification system to be used directly without the use of time-warping or other dynamic programming algorithms, every input file in the dataset is constrained to a fixed length, something that would not be required by the more flexible standard HMM-based approaches to speech recognition. Although the fixed-length constraint is not unreasonable for a small vocabulary keyword recognition task, we were motivated to consider a more recent end-to-end (E2E) approach – specifically the attention-based encoder-decoder architecture – as a means of allowing input of arbitrary length, whilst retaining the simplicity of a single DNN-based discriminative classifier. This approach also allows us to readily switch between sub-word (phoneme or grapheme) and word modeling by just changing the target output.

In this paper, we experiment with the use of a sequence-to-sequence (S2S) model for a modified version of the Speech Commands task, comparing it with the traditional deep neural network (DNN)-HMM approach. In the literature, S2S models are usually applied on large vocabulary tasks with large datasets and it is not obvious that they will work well in our setup.

The obvious limitation of the small vocabulary approach we take is that trained system is confined to the list of commands defined in the original data. To alleviate this constraint, we also explore strategies to extend the set of commands requiring very few examples (the few-shot learning problem [2, 3, 4]).

The remainder of the paper is organized as follows: relevant literature is presented in section 2, methodology and experiments in sections 3 and 4 respectively, and we conclude in section 5.

2 Related work

E2E training has attracted much attention recently. One of the first breakthroughs came from the connectionist temporal classification (CTC) loss [5], which allows an acoustic neural model to be trained directly on unsegmented data. While the original technique is not E2E, it has later been extended to train models that predict grapheme sequences [6]

or in conjunction with a language model (LM) based on recurrent neural networks (RNNs), an architecture refered to as the RNN-transducer

[7]. More recently, the attention-based encoder-decoder model has been applied to automatic speech recognition (ASR) (see e.g. [8, 9]).

If the simplicity of the training procedure of E2E systems is attractive, they generally show reduced performance over traditional HMM-based systems, especially so when used without an external LM, a good example being [9]. Using a much bigger dataset [10] managed to reach competitive results on a dictation task, but was still performing worse on voice-search data. This doesn’t mean though that E2E models will necessarily be bad in lower resource conditions. For example, [11]

achieved competitive results on several languages, even though it failed to surpass a DNN-HMM baseline. To the best of our knowledge, E2E models have never been applied to small vocabulary speech recognition tasks before. The work closest to ours is probably

[12] where an attention-based E2E architecture is applied to keyword spotting. Though, despite the vocabulary being reduced to one word, a very large dataset is used.

Between the different E2E approaches, the attention-based encoder-decoder architecture has been shown to give better results [10]. While the original model [13] was proposed for machine translation, several ways to adapt it for speech recognition have since been proposed. A first difference with machine translation resides in the ratio between the length of the input and output sequences: in speech recognition, the input sequence tends to be much longer than the output sequence. [9]

proposed to use pyramidal layers to downsample the input. This reduces the number of hidden states the attention has to attend to, thus improving both the accuracy and the computational performance. Similarly, convolutional neural networks (CNNs) have been shown to be effective

[14], leading to further improvement. Another concern pertains to the global attention mechanism which is a bit too flexible for speech recognition (an essentially monotonic left-to-right process). Ways to encourage monotonicity [8] or ensure the local and monotonous nature of the attention system [15] have thus been proposed. Taking a different approach, a hybrid CTC/Attention architecture trained in a multi-task fashion has been proposed in [16, 17]. The idea there is to use the monotonous and left-to-right properties of CTC to find better alignments, which compensate for the over-flexibility of the attention-based decoder.

3 Methodology

The main task considered here corresponds to the one proposed in Tensorflow’s Speech Commands challenge mentioned earlier, that is keyword classification. In addition to the keywords, two additional classes are considered: (i) a _silence_ category corresponding to records free of speech, and (ii) an _unknown_ category for records containing speech that is none of the keywords.

3.1 CNN-HMM baseline

In previous experiments on the SC task, we obtained our best performance with a DNN-HMM system using CNNs. The CNN-HMM model has been trained using a standard Kaldi recipe333egs/rm/s5/local/nnet/ [18]. It is composed of two 2D (across both time and frequency) convolutional layers followed by 4 fully connected layers. The network is trained with cross-entropy, followed by 1 iteration of discriminative training with the state-level minimum Bayes risk (sMBR) objective [19]. We use as input 11 frames (5 from both sides) of 40 filterbank coefficients, augmented with and features.

3.2 End-to-end model

We opted for the attention-based encoder-decoder approach, and more precisely the hybrid CTC/Attention model from

[16, 17] which showed promising results and for which the code was available444

. We used a CNN-based encoder that is composed of 4 convolutional layers, with 2 max pooling layers (after the second and fourth convolutions). Each pooling layer has a reduction factor of 2, thus downsampling the timescale of the input by 4 overall. Four layers of 320 bidirectional long-short term memory (BiLSTM) units sit on top of the CNN part.

We used the location-aware attention mechanism [20]

and a layer of 300 LSTM cells for the decoder. Default hyperparameters from the

voxforge recipe were used unless stated otherwise. The input was composed of 80 fbanks and we experimented with 3 different types of labels: phonemes, graphemes and words.

3.3 Strategies for few-shot learning

The main limitation of our small vocabulary approach is its flexibility. The ASR system is limited to the set of keywords it was trained on and no guarantee is given that it will generalize to new ones (in fact we expect it to recognize them poorly if at all). This is a limitation that is hardly manageable in practical usage. To alleviate it, we propose to explore strategies for few-shot learning, where one can gather few examples of a new word and use them to retrain or adapt the existing system, so that it will perform better on this new word.

The simplest strategy we tried consists in adding the examples of the new keywords to the training set from the beginning and train a new model on it (a method referred to as retrain hereafter). One issue with this method is that the new keywords will be under-represented compared to the original ones. To improve on that, we propose to try oversampling the few-shot examples, that is we will see these examples

times during an epoch (where

is the oversampling factor) when the original examples will be seen only once.

The main limitation of the retrain strategy is that it requires to retrain the model from scratch every time. Alternatively, we propose a method based on adaptation (referred to as adapt) where we start from a model trained on the 12 original categories (see section 3.2). We then adapt all its weights by training it for a few more epochs on the few-shot examples, keeping the same training procedure otherwise. To avoid performance deteriorating on the original keywords, we also include some of their examples, with the same number of examples per class as for the few-shot classes. The overall number of examples being very small, few updates are made per epoch. We thus expect higher learning rates to be useful. We also optimized the number of epochs which plays a complementary role.

One drawback of this strategy, however, is that the model we start from may not contain all the output labels required for the new keywords (limited to the phonemes or graphemes present in the 10 original keywords). We solve this problem by replacing the missing phonemes (resp. graphemes) by the UNK model (resp. the character _) initially introduced for the _unknown_ category (see section 4.1). This can result in a dramatic change as exemplified by the word backward for which most phonemes are absent from the pretrained models output. The original transcription ”B AE K W ER D” becomes ”UNK UNK UNK UNK UNK D” after replacement. Similarly for graphemes, replacing missing characters leads to the transcription ”????w?rd”. In view of this limitation and in order to compare the adapt strategy more fairly with the retrain approach, we introduced the retrain_replace strategy. This strategy uses the same training procedure as the retrain method, but with the modified labels.

4 Experiments

4.1 Experimental setup

Our experimental setup is derived from the second version of the SC dataset [1]. It contains 105,800 recordings of 35 different keywords. Each record has a fixed duration of 1 second. The results in [1] correspond to the task that was proposed for the original challenge, that is the classification of 10 keywords (out of the 35 available), the remaining ones being used to populate the _unknown_ category.

This setup has been slightly modified here. A limitation of the original design is that the same collection of words are used for the _unknown_ category at both training and test time. In order to better evaluate the generalization capability of the model to unseen words (which will inevitably happen given the number of words we use to train this category), we decided to exclude some of them from the training set. Also, as mentioned earlier, we are interested in exploring strategies for few-shot learning. Hence, we also kept a few words aside for use in those experiments.

Set Words
org_kwd down, go, left, no, off, on, right, stop, up, yes
org_unk bed, bird, cat, dog, happy, house, marvin, sheila, tree, visual, wow
new_kwd forward, four, one, three, two, zero
new_unk eight, five, follow, learn, nine, seven, six
Table 1: List of keywords assigned to the different sets.

The 10 original keywords (from here on referred to as the org_kwd set of words) are kept identical. The 25 remaining ones however are split into two main categories: 7 are used as new keywords in the few-shot experiments (refered to as new_kwd) and 18 as unknowns (the unk set). This later group is further split into 11 words (org_unk) that are used for training and evaluation, while the remaining 7 (new_unk) are seen at evaluation time only. Table 1 gives the list of words assigned to each category.

The split of the data in training, validation and test sets was done using the procedure provided in Tensorflow’s example code with 80%, 10% and 10% for each set respectively. The _unknown_ category being the combination of several keywords, it is over-represented in the dataset. To prevent it from dominating the learning procedure, we downsample it for the training set, randomly selecting a number of examples corresponding to the mean number of examples we have for the keywords (org_kwd). Finally, for all experiments on few-shot learning, we randomly sample examples of each new class from the training records we have kept aside.

In the phoneme-based experiments, we introduce a special phoneme, labelled UNK, to model the words of the _unknown_ category. Similarly, for grapheme-based or word-based experiments, all the words are transcribed with the unique character ?. In all conditions, we further map all output not corresponding to one of the keywords or _silence_ to the _unknown_ category.

4.2 End-to-end approach for small vocabulary ASR

Model org_kwd unk new_kwd
Phoneme-based S2S 96.6 62.0 -
Grapheme-based S2S 96.6 57.6 -
Word-based S2S 96.4 59.5 -
retrain-10 96.1 63.6 36.5
retrain_replace-10 96.0 64.1 36.6
adapt-10 86.8 42.3 23.6
retrain-100 96.1 54.2 82.1
retrain_replace-100 96.1 56.8 81.9
adapt-100 92.3 39.6 89.9
Table 2: Validation accuracy (%) of the S2S model with different types of outputs (first set of rows, trained on 12 classes only) or with different few-shot learning strategies (second and third set of rows, trained with 7 additional classes).

We first report results on the original classification task trained on 12 categories, comparing traditional and E2E pipelines. Table 3 summarizes the results obtained with the S2S model for different types of outputs, where we see that accuracy of the three S2S models on the keywords are very close. Looking at the performance of the same models on the unk set, we see they all show a big drop in accuracy, as expected with only 11 different words to populate the _unknown_ category for training. It can be noticed though that the phoneme-based S2S model generalizes significantly better than its two competitors.

Hence, the greater simplicity of the grapheme- or word-based approaches, which don’t require a pronunciation dictionary, can be traded off for better performance. Moreover, it has to be highlighted that in our small vocabulary context, building the pronunciation dictionary is greatly simplified compared to the large vocabulary context.

Model org_kwd unk new_kwd
CNN-HMM (baseline) 95.8 45.5 -
Phoneme-based S2S 97.5 59.5 -
retrain-10 96.9 60.3 34.8
adapt-10 86.6 42.5 22.6
retrain-100 97.0 51.8 80.3
adapt-100 92.4 38.1 88.6
Table 3: Test accuracy (%) of the baseline and the main S2S models. The first set of rows correspond to models trained on 12 categories only, while second and third set of rows correspond to few-shot experiments.

In table 3, we compare the test accuracy of the best E2E model with the CNN-HMM baseline. On the main task, the E2E approach beats the baseline by 1.7% absolute. This is very promising as it shows that E2E models are a competitive alternative to more traditional approaches for our task. The results on the unk set shows that they also generalize much better, the E2E approach beating the baseline by 14% absolute on this subset.

Figure 1: Attention weights produced by the phoneme-based S2S model for two examples of the words ”stop” (left) and ”yes” (right), with their respective alignment on top.

Finally, we give some insight on the behavior of the S2S models. As can be seen from figure 1, and confirmed by manual inspection, the attention tends to focus on a single portion of each input and doesn’t shift as the output tokens are produced. It appears that the model representation is more akin to word than sub-word modeling, as is usually observed. With our small vocabulary, the model is apparently able to discriminate between the different keywords with a single ”glance” at the data. For example, in the case of the word stop, the model seems to attend to the phoneme AA (left part of figure 1). More surprisingly, in the case of the word yes, the model seems to seek information from a fixed position, even if it falls in the silence preceding the word. A more quantitative analysis would be required to better understand those dynamics, which maybe related to the effective window size of the encoder.

4.3 Few-shot learning

For low values of , the variability introduced by the random selection of the few-shot examples is high. We thus report here the mean accuracy over 3 runs for all experiments on few-shot learning.

We experiment with . While 100 examples may seem a lot for few-shot learning, it allows to test how the different strategies behave when the number of examples increase. It is also to be noted that 100 examples is only 2.6% of the number of examples available for the original classes ( 3850 on average).

Figure 2: Validation accuracy for the new keywords with the retrain strategy. We compare phoneme- and grapheme-based outputs, for

= 10 or 100. The bars represent the standard deviation over 3 runs.

For the retrain strategy, we tried oversampling the new keywords up to 3000 simulated examples ( for and for ), so as to reach similar frequency in training and test sets. As figure 2 shows, the retrain strategy gives very variable results on the new_unk set without oversampling ( = 1). In contrast, higher values of give better scores and lower variability. The best results are obtained with for , where the grapheme-based model achieves 36.5.% of accuracy. Conversely with , best results are obtained for with the phoneme-based model, for an accuracy of 82.1%.

Figure 3: Validation accuracy for the new keywords (plain) and the original keywords (dashed) with the adapt strategy, as a function of the learning rate. Phoneme- and grapheme-based outputs are compared for 10 and 100 fewshot-examples ().

Figure 3 shows the accuracy with the adapt strategy on the new_kwd set as a function of the learning rate (). For each experiment, the best number of epochs is selected based on the validation set. We see that increasing the learning rate is necessary to get good result, but too high a value and the results deteriorate as we overfit the small training set. The best results are achieved with the phoneme-based output, with a learning rate of 5 (resp. 3) for (resp. ). We also reported the accuracy on the org_kwd set to show how the original training is progressively undone as the learning rate increases. As one can see, the performance degrades consistently and at some point drops suddenly as the model overfits the adaptation set.

Table 2 summarizes the accuracy of the different strategies on the validation set for the two values of (10 and 100) with the hyperparameters giving optimal scores on the new_kwd set. A first and surprising observation is that the phoneme/grapheme replacement rules introduced in section 3.3 for the adapt strategy doesn’t seem to penalize the performance on the new_kwd set. The retrain_replace strategy gives results very close to the retrain one overall. Comparing the performance of the adapt and retrain strategies now, we see that adaption is not only much faster to train but is also the most performant on the new keywords for . Though, this is is achieved at the expense of the org_kwd and unk sets. Table 3 summarizes the test accuracy of the best models for both strategies (retrain and adapt).

5 Conclusion

In this paper, we proposed to study the adequacy of E2E approaches on a small vocabulary task, in order to simplify the process of training a keyword/command recognition system and make this technology more accessible. We found that they can be competitive in such a context, giving better results than a strong CNN-HMM baseline. We also proposed two few-shot strategies. By simply training a model from scratch on the combination of the original dataset and those new examples, we managed to reach 34.8% of accuracy with only 10 examples per new keyword and 80.3% with 100 examples. A faster adaptation strategy was also proposed which achieves even better results with 100 examples reaching 88.6% of accuracy, but at the expense of the performance on the original keywords. This results may be further improved by using more advanced strategies.

We have also shown that the dynamic of the hybrid CTC/Attention model in our task is quite different from what is usually observed with large vocabulary tasks. It would be interesting in the future to analyze more deeply the behavior of the model as one move from a small vocabulary keyword task to a large vocabulary one with complex sentences.

6 Acknowledgements

We would like to thank Ondřej Klejch, Joachim Fainberg and Joanna Rownicka for their help with the SC Dataset and for sharing their code, on which our baseline is based. We would also like to thank Sameer Bansal for the very fruitful discussions on sequence-to-sequence models for ASR.