A Study of Low-Resource Speech Commands Recognition based on Adversarial Reprogramming

10/08/2021
by   Hao Yen, et al.
0

In this study, we propose a novel adversarial reprogramming (AR) approach for low-resource spoken command recognition (SCR), and build an AR-SCR system. The AR procedure aims to modify the acoustic signals (from the target domain) to repurpose a pretrained SCR model (from the source domain). To solve the label mismatches between source and target domains, and further improve the stability of AR, we propose a novel similarity-based label mapping technique to align classes. In addition, the transfer learning (TL) technique is combined with the original AR process to improve the model adaptation capability. We evaluate the proposed AR-SCR system on three low-resource SCR datasets, including Arabic, Lithuanian, and dysarthric Mandarin speech. Experimental results show that with a pretrained AM trained on a large-scale English dataset, the proposed AR-SCR system outperforms the current state-of-the-art results on Arabic and Lithuanian speech commands datasets, with only a limited amount of training data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

09/15/2019

LRS-DAG: Low Resource Supervised Domain Adaptation with Generalization Across Domains

Current state of the art methods in Domain Adaptation follow adversarial...
09/12/2021

Unsupervised Domain Adaptation Schemes for Building ASR in Low-resource Languages

Building an automatic speech recognition (ASR) system from scratch requi...
02/15/2017

Transfer Deep Learning for Low-Resource Chinese Word Segmentation with a Novel Neural Network

Recent studies have shown effectiveness in using neural networks for Chi...
09/26/2019

DARTS: Dialectal Arabic Transcription System

We present the speech to text transcription system, called DARTS, for lo...
09/16/2019

Fast transcription of speech in low-resource languages

We present software that, in only a few hours, transcribes forty hours o...
04/16/2021

LU-BZU at SemEval-2021 Task 2: Word2Vec and Lemma2Vec performance in Arabic Word-in-Context disambiguation

This paper presents a set of experiments to evaluate and compare between...
12/31/2021

TransLog: A Unified Transformer-based Framework for Log Anomaly Detection

Log anomaly detection is a key component in the field of artificial inte...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The aim of spoken command recognition (SCR) is to identify a target command out of a set of predefined candidates, based on an input utterance [1, 2]. Owing to its wide applicability to various domains, such as smart homes [3] or crime defections [4], SCR has long been an important research topic in the speech processing field [5, 6]

. Due to recent advances of deep learning (DL) algorithms, the performance of SCR systems has been significantly enhanced 

[7, 8, 9]. However, a common requirement to build a high-performance DL-based SCR system is to prepare a large amount of labeled training data (speech utterances and corresponding transcriptions of the commands). Such requirement is not always realizable in real-world scenarios. In fact, it is generally favorable to build a SCR system with only a limited amount of training data. Such a scenario is often referred to as a low-resource training scenario [10].

Numerous algorithms have been developed to train DL-based systems under low-resource scenarios. A well-known category of approaches is transfer learning (TL) [11], which aims to use a small amount of training data to fine-tune a pretrained model, where the pretrained model is generally trained on a large-scale dataset. Prior arts have demonstrated the effectiveness of TL in the speech signal processing tasks. For example, in [12], an English acoustic model (AM) pretrained on a large-scale training set is fine-tuned with limited training data to obtain a Spanish AM. Meanwhile in [13], a multilingual bottleneck feature extractor is pretrained on a large-scale training set and fine-tuned to form a keyword recognizer on a low-resourced Luganda corpus. The study of [14] collected 200-million pieces of 2-second data from Youtube to pretrain a speech embedding model to extract useful features for a downstream keyword spotting task. Although these TL approaches show promising results, the fine-tuning process (which is often done in the online mode) requires large training resources and thus is only feasible for applications where sufficient computation resources are available.

Another category of approaches is to adopt a pretrained model to extract representative features to facilitate efficient and effective training for the SCR systems. The pretrained model is generally trained on a large-scale dataset with either a supervised or self-supervised training manner. In [15] and [16]

, the SCR systems were established by adopting representative features extracted from pretrained sound event detector and phone classifier, respectively. Both were trained on large-scale labeled data. Meanwhile, several methods adopt self-supervised models, such as PASE+ 

[17, 18] , wav2vec [19], and wav2vec 2.0 [20], as feature extractors to build the SCR systems [21, 22, 23]. A notable drawback of this category of approaches is that an additional large-scale DL-based model (accordingly with increased hardware) is required.

Figure 1: Illustration of the proposed AR-SCR system. This figure shows the acoustic signals of a Lithuanian command (“ne”) is reprogrammed to English commands (“nine” and “no”) and mapped to its final prediction with a pretrained English acoustic model (AM).

Adversarial reprogramming (AR), as an alternative model adaptation technique, has been confirmed to provide satisfactory results in numerous machine learning tasks  

[24]. In [25], AR adopts a trainable layer to generate additive noise (e.g., as additional information) on input acoustic signals (source-domain data) to guide an AM to recognize electrocardiography (ECG) signals (target-domain data). Along with the success of [25], our study investigates whether AR can be applied to domain adaptation and accordingly building a SCR system in a low-resource training scenario. Fig. 1 shows the design concept of the proposed AR-SCR system, which consists of a reprogram layer and a pretrained AM. The reprogram layer first generates trainable noises,

, to modify the original signals before passing them into the pretrained AM. The AM will then output class probabilities corresponding to the source classes. A label mapping technique is adopted to map the probabilities of source classes to the target class by aggregating probabilities over the assigned source labels. Based on the aggregated probabilities, the reprogram layer is further trained to generate noises and thus modify the input signals, so that the pretrained AM can be repurposed to perform recognition in the target task.

The proposed AR-SCR system adopts two additional techniques to further improve the model adaptation capability: (1) a novel similarity-based label mapping strategy that aims to align the target and source classes and (2) a fine-tuning process that adjusts the AM with the AR-generated signals. Experiment results on three low-resource SCR datasets, including Arabic, Lithuanian, and dysarthric Mandarin speech command datasets, demonstrate that the proposed AR-SCR system can yield better a performance than other state-of-the-art methods. In summary, the major contribution of the present work is twofold: 1) This is the first study that investigates the applicability of AR to low-resource SCR tasks with promising results. 2) We verify that AR has the flexibility to combine with the TL technique to achieve a better model adaptation performance.

Figure 2: Frameworks studied in this work. The ”AM” block refers to acoustic model. In (a), the baseline system is trained from scratch on the target-domain data. In (b), AM is pretrained and then fine-tuned on the target-domain data. In (c), AM is pretrained on source domain and then fixed; a reprogram layer is then placed before the pretrained AM model to modify the input signals. In (d), we combine AR and TL to train the reprogram layer and fine-tune AM simultaneously.

2 The Proposed AR-SCR System

2.1 AR-SCR System

Fig. 2 illustrates the overflow of a SCR system with AR and TL model adaptation techniques. In Fig. 2 (a), the AM is directly trained by data from the target-domain task. When the training data from the target domain is scarce, the model cannot be trained well and thus may result in unsatisfactory recognition performance. In Fig. 2 (b), the TL technique is applied on the pretrained AM to establish a new SCR system that matches the target domain. In Fig. 2 (c), the pretrained AM is fixed, and a reprogram layer is trained to transform the input signals in order to reduce the distance between the target and source distributions. In Fig. 2 (d), the reprogram layer is treated as a front-end processor, and the TL technique is applied to further fine-tune AM with the reprogrammed signals. We expect the combination of AR (as a front-end processing) and TL (as a back-end processing) can reach better model adaptation capability due to their complementary abilities.

2.2 Acoustic Signal Reprogramming

The concept of AR was first introduced in [24], and its aim was to determine a trainable input transformation function to repurpose a pretrained model from the source domain to carry out a target task. The authors in [24]

showed that by the AR process, a pretrained ImageNet model trained on the image classification task can solve a square-counting task with high accuracy. A later study 

[26] demonstrated that a reliable classification system can be established using AR and a black-box pretrained model with scarce data and limited resources. Meanwhile, the Voice2Series method [25] is proposed to transfer time series data (e.g., ECG or Earthquake) , as the target domain, from the source domain , where . For these AR approaches, a reprogrammed sample can be formulated as:

(1)

where generates a zero-padded time series of dimension . The binary mask indicates the indexes that are not occupied and reprogrammable. is a set of trainable parameters for aligning source and target domain data distributions. The term denotes the trainable additive input transformation for reprogramming. In the original AR method, the target sequence must be shorter than the source counterpart. To overcome this limitation, we design to add trainable noises to the whole sequence, and thus Eq. (1) becomes

(2)

In this work, we focus on applying AR as a model adaptation technique to effectively fine-tune a pretrained model.

2.3 Pretrained Acoustic Model

In the Voice2Series study [25]

, the authors have compared several well-known AMs as pretrained models and provided the first theoretical justifications by optimal transport for reprogramming general time-series signals to acoustic signals. Based on the provided justifications, in this study, we established AM with two layers of fully convolutional neural networks, followed by two bidirectional recurrent neural networks, which are then combined with an attention layer. To train the model, we use the Google Speech Commands dataset 

[27], which is a large-scale collection of spoken command words, containing 105,829 utterances of 35 words from 2,618 speakers; all the utterances are recorded in a 16 kHz sampling-rate format. The pretrained AM has 0.2M parameters and yields 96.90% recognition accuracy rate on the testing set of Google Speech Commands.

2.4 Similarity Label Mapping

As illustrated in Fig. 1, a label mapping function is adopted to map the probabilities of the source class to that of the target class. The results in [25] show that a many-to-one label mapping strategy (randomly mapping multiple classes from the source task to an arbitrary target class) yields a better performance as compared to the one-to-one mapping strategy. In this study, we attempt to improve the random mapping process with a similarity mapping that considers the relationships of data in the source and target domains. In [28], the structural relationships between acoustic scene classes are explored and utilized to address the domain mismatch issue. Inspired by the prior art [28]

, we propose to investigate the similarity of the labels between the source and target domains and determine the optimal many-to-one label mapping strategy. To compute the similarity, we first feed source and target data to the pretrained AM and calculate the average representations of all classes and compute the cosine similarity between each of them. Based on the similarity of classes, each target class is mapped to two or three source classes in our AR-SCR system. Fig. 

3

(a) and (b), respectively, show the PCA plots of representation vectors for the English-Lithuanian and English-Arabic datasets. Interestingly, we can observe that command words with similar acoustic characteristics are mapped to the same target word. For example, in Fig. 

3 (a), the source English source classes ”nine, no, learn” are mapped to the target Lithuanian class ”ne”; whereas in Fig. 3 (b), the source English classes ”right, eight” are mapped to the target Arabic class ”Takeed”. In our experiment, the AR system with similarity mapping strategy outperforms the one with random mapping [25] by increasing the average testing accuracy by 2.1%, 13.4%, 9.2% on the Arabic, Lithuanian, and Mandarin datasets respectively.

Figure 3: PCA plots of average representations of several source-target pairs for (a) English-Lithuanian and (b) Emglish-Arabic datasets. A target class (star point) is mapped to two or three source classes (circle points) with higher cosine similarity (marked with same colors).

3 Experiments

3.1 Speech Commands Dataset

As mentioned earlier, we use the Google Speech Commands dataset [27] as the source domain data to pretrain our AM. Three low-resource SCR datasets, including the Arabic, Lithuanian, and dysarthric Mandarin speech command datasets, are used as the target domain tasks.
Arabic Speech Commands: The Arabic speech commands dataset [29] consists of 16 commands, including 6 control words and 10 digits (0 through 9), and each command have 100 samples, amounting 1600 utterances in total. 40 speakers are involved in preparing the dataset. The speech utterances are recorded at a sampling rate of 48 kHz and then converted into 16 kHz in our experiments. We follow the setting in [29] and split the dataset into 80% for training and 20% for testing. In addition, we randomly excerpt 20% of the training data to form a validation set.
Lithuanian Speech Commands: The content of the Lithuanian speech commands dataset [22] is created by translating 20 keywords from Google Speech Commands [27] into the Lithuanian language. The dataset consists of recordings from 28 speakers, with each speaker pronouncing 20 words on a mobile phone. We follow the setting in [22] and [30], and chose 15 target classes: 13 command words, 1 unknown word, and 1 silent class. The resulting dataset consists of 326 recordings for training, 75 for validation, and 88 for testing.
Dysarthric Mandarin Speech Commands: The dataset contains 19 Mandarin commands, each uttered 10 times from 3 dysarthric patients [31], with 16kHz sampling rate. These 19 commands include 10 action commands and 9 digits, which are designed to allow dysarthric patients to control web browsers via speech. By removing very long recordings, we select 13 short commands in our experiments, and the duration of each command is approximately one second. We follow the setting in [31] to split the whole dataset into 70% and 30% to form training and testing tests.

width=0.48 System Trainable para. Average Acc. (%) Relative improv. (%) std. Baseline 200.5k 92.6 AR 10k 94.8 2.38 TL 200.5k 98.6 6.48 AR+TL 211.2k 98.9 6.80 LSTM [29] 98.13

Table 1: Testing results (average and std of the accuracy scores) of the Arabic speech commands dataset. The state-of-the-art result using LSTM [29] is also reported for comparison.

width=0.48 Limit System Average Acc. (%) Relative Impro. (%) std. 3 Baseline 33.8 AR 25.1 -25.7 TL 52.8 56.2 AR+TL 58.9 72.8 10 Baseline 66.2 AR 34.4 -48.2 TL 80.7 21.9 AR+TL 81.9 23.7 20 Baseline 70.3 AR 34.6 -51.4 TL 86.8 23.5 AR+TL 88.6 26

Table 2: Testing results (average and std of the accuracy scores) of three conditions (3, 10, and 20 training samples) for the Lithuanian speech commands dataset. The results of the state-of-the-art system  [22] are also reported for comparison.

3.2 Experimental Results

For each of the three datasets, we compare the SCR results of the four systems, as shown in Fig. 2. The results of the baseline system (denoted as Baseline), stands for an AM trained from scratch on the training set of the three SCR datasets. The results of the systems with AR and TL model adaptations (Fig. 2 (b) and (c)) are denoted as AR and TL, respectively. The results of combining the AR and TL methods, referred to Fig. 2

(d), are denoted as AR+TL. In our preliminary experiments, we have tested the performance of several different setups. We have compared the domains for signal reprogramming by directly modifying the input waveforms or modifying the input spectral features. Meanwhile, we evaluated different label mapping techniques, i.e., one-to-one versus many-to-one, as well as random mapping versus similarity mapping. In the following discussions, we only report the results of the best setups. For all the experiments in this study, we tested each SCR system 10 times and the average accuracy and standard deviation (std.) values are reported in each table.

Table 1 lists testing results of the Arabic speech commands dataset. From the table, we first note that the three adaptation systems (AR, TL, and AR+TL) all yield improved accuracy rates with lower std as compared to Baseline. Next, we note that AR+TL can achieve the highest accuracy of 98.9%, which outperforms individual AR (94.8%) and TL (98.6%). Based on our literature survey, AR+TL also outperforms the state-of-the-art SCR system [29] on this Arabic speech commands dataset.

width=0.48 System Trainable para. Average Acc. (%) Relative improv. (%) std. Baseline 200.4k 64.0 AR 16k 33.2 -48.12 TL 200.4k 78.1 22.03 AR+TL 217.2k 82.3 28.59

Table 3: Testing results (average and std of the accuracy scores) of the dysarthric Mandarin dataset.
Figure 4: The best-1 accuracy values of Baseline, the state-of-the-art wav2vec model [22]

, and the proposed AR+TL system on the Lithuanian speech command dataset. We follow the same evaluation metrics in 

[22] and only reported the best-1 test accuracy values.

Table 2 lists the testing results of the Lithuanian speech commands dataset. We follow the setting in [22] to conduct experiments using different amounts of training data and report the results of three conditions: each keyword has 3, 10, and 20 training samples. From Table 2, we observe that TL consistently outperforms Baseline for the three conditions. Notably, although AR alone cannot attain improved performance, AR+TL can yield the average accuracy of 58.9%, 81.9%, and 88.6% for 3, 10, and 20 training samples conditions, respectively, which are notably better than TL with the average accuracy of 52.8%, 80.7%, and 86.7%. Moreover, the std values in Table 2 show that AR can improve the performance stability for both working alone or combined with the TL technique. Fig. 4 also compares our best AR+TL system with the state-of-the-art SCR system [22], where the best-1 accuracy results are reported for the 3, 10, and 20 conditions. From the figure, AR+TL consistently outperforms the state-of-the-art SCR system [22] on the Lithuanian speech commands dataset. Table 3 shows the results of the dysarthric Mandarin dataset. A similar observation can be obtained as those from Table 2. First, TL yields an improved performance as compared to Baseline. Next, although AR alone can not provide better results than Baseline, AR+TL achieves the best performance among the four systems. As compared to Baseline, AR+TL yields notable improvements of 28.59% (from 64.0 to 82.3). Moreover, the low std values of suggests that AR can improve the stability for the SCR systems.

3.3 Visualization

In addition to the quantitative results, we qualitatively compare the original and reprogrammed signals in the time- and spectral-domains, along with their class activation mappings (CAM) [32, 33, 34]. The left and right columns of Fig. 5 show the results of sample utterances from the Lithuanian and dysarthric Mandarin datasets, respectively. In Fig. 5 (a), the original and reprogrammed signals are denoted as black and red colors, respectively. It is noted that the noise signals generated by the AR layer are added to the target signals (Lithuanian and dysarthric Mandarin data) to repurpose the pretrained AM (trained on English data). The Mel-spectrogram plots shown in Fig. 5 (b) present the spatial-temporal characteristics of the reprogrammed signals. From Fig. 5 (a) and (b), the noise signals are less noticeable as compared to the target signals in both time and spectral domains. Furthermore, we selected the first and the second convolution layers for CAM visualizations and showed the results in Fig. 5 (c) and (d), respectively. For the CAM plots, a lighter color indicates a higher activation value. The CAM plots in Fig. 5 (c) and (d) demonstrate that the first convolution layer tends to focus on the target signals, whereas the second convolution layer tends to focus on the additive noise parts. The results suggest that the pre-trained SCR model use both information from target signals and additive noise to perform command recognition.

(a) Targeted (black) and reprogrammed (red) time series
(b) Mel-spectorgram of reprogrammed input
(c) Class activation mapping of (b) from the conv-layer
(d) Class activation mapping of (b) from the conv-layer
Figure 5: Visualization of an utterance sample from the Lithuanian (left column) and Dysarthric (right column) dataset using reprogramming. The rows are (a): the original target time series and its reprogrammed pattern; (b): Mel-spectrogram of the reprogrammed input; and (c)/(d) its neural saliency maps via class activation mapping from the first/second convolution layer.

4 Conclusion

In this paper, we introduce the an AR approach to establish a SCR system with a very limited amount of training data. Experimental results first demonstrate that AR can effectively improve the accuracy over the baseline system with few trainable parameters on the Arabic dataset. The results also show that the proposed AR-SCR system can yield better performance as compared with state-of-the-art SCR methods on the Arabic and Lithuanian speech command datasets [29, 22]

. In the future, we will explore to further improve the AR-SCR performance by combining data augmentation and self-supervised learning methods. Meanwhile, we will investigate to the application of proposed AR approach to improve the model safety 

[35, 36] for both classification and regression tasks. The codes used in this study will be released at https://github.com/dodohow1011/SpeechAdvReprogram.

Acknowledgement

The authors thank valuable comments and supports from Prof. Chin-Hui Lee at Georgia Institute of Technology. This work is partially supported by AS Grants: AS-CDA-106-M04 and AS-GC-109-05.

References