Deep Reinforcement Learning with Pre-training for Time-efficient Training of Automatic Speech Recognition

05/21/2020 ∙ by Thejan Rajapakshe, et al. ∙ University of Southern Queensland 0

Deep reinforcement learning (deep RL) is a combination of deep learning with reinforcement learning principles to create efficient methods that can learn by interacting with its environment. This has led to breakthroughs in many complex tasks, such as playing the game "Go", that were previously difficult to solve. However, deep RL requires significant training time making it difficult to use in various real-life applications such as Human-Computer Interaction (HCI). In this paper, we study pre-training in deep RL to reduce the training time and improve the performance of Speech Recognition, a popular application of HCI. To evaluate the performance improvement in training we use the publicly available "Speech Command" dataset, which contains utterances of 30 command keywords spoken by 2,618 speakers. Results show that pre-training with deep RL offers faster convergence compared to non-pre-trained RL while achieving improved speech recognition accuracy.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement Learning (RL) follows the principle of behaviourist psychology and learns similarly as a child learns to perform a new task. RL has been repeatedly successful in the past [1, 2], however, the successes were mostly limited to low-dimensional problems. In recent years, deep learning has significantly advanced the field of RL, with the use of deep learning algorithms within RL giving rise to the field of “deep reinforcement learning”. Deep learning enables RL to operate in high-dimensional state and action spaces and can now be used for complex decision-making problems [3].

Deep RL algorithms have been applied to video or image processing domains spanning video games [4, 5] to indoor navigation [6]. Very few studies have explored the promising aspects of deep RL in the field of audio processing particularly, in speech processing [7]. In this paper, we focus on this under-researched topic. Specifically, we conduct a case study of the feasibility of deep RL for automatic speech command classification.

A major challenge of deep RL is that it often requires a prohibitively large amount of training time and data to reach a reasonable accuracy, making it inapplicable in real-world settings [8]. Leveraging humans to provide demonstrations (known as learning from demonstration (LfD)) in RL has recently gained traction as a possible way of speeding up deep RL [9, 10, 11]. In LfD, actions demonstrated by the human are considered as the ground truth labels for a given input game/image frame. An agent closely simulates the demonstrator’s policy at the start, and later on, learns to surpass the demonstrator [8]. However, LfD holds a distinct challenge, in the sense that it often requires the agent to acquire skills from only a few demonstrations and interactions due to the time and expense of acquiring them [12]. Therefore, LfDs are generally not scalable, especially for high-dimensional problems.

Pre-training the underlying deep neural network is another approach to speed up training in deep RL. It enables the RL agent to learn better features which leads to better performance without changing the policy learning strategies

[8]. In supervised methods, pre-training helps regularisation and enables faster convergence compared to randomly initialised networks [13]. Various studies (e. g., [14, 15]) have explored pre-training in speech recognition and achieved improved results. However, pre-training in deep RL is hardly explored in the area of speech recognition. In this paper, we propose a deep RL framework for speech recognition and evaluate the performance of pre-training to reduce the training time.

2 Related Work

Deep RL often requires prohibitively large amounts of training time and data to achieve a reasonable performance, which makes it unsuitable for real-world applications. Pre-training in deep RL is useful to speedup the training process and to reduce the requirement of a large amount of data [16]. Authors in [17] use sparse variational dropout regularisation for pre-training RL and show that pre-training allows an RL algorithm to learn optimal policies for high-dimensional continuous control problems in a practical time frame. In [18]

, the authors combine Deep Belief Networks (DBNs) with RL to take advantage of the unsupervised pre-training phase in DBNs and then use the DBN as the opening point for a neural network function approximator. The authors in 

[19] demonstrate that a pre-trained hidden layer architecture can reduce the time required to solve RL problems. While these studies show the promise of using pre-trained deep RL, they are not in audio domains. The feasibility of pre-training RL for the audio is not yet well understood. In this paper, we investigate the usability of pre-training of deep RL for speech recognition.

Although we could not find studies using pre-training of RL for for audio, some studies used pre-training in speech research for Deep Learning (DL) models. Thomas et al. [14] utilised pre-training for Deep Neural Networks (DNN), where they achieved excellent results for speech recognition, by utilising only 1 hour of transcribed training data. Some studies (e. g., [20]) also achieved promising results for cross-lingual acoustic data using pre-training in deep learning neural networks. In contrast to these studies, we use pre-training for speech-based systems in deep RL setting.

3 Methodology

Feature Learning and Policy Learning are the main two sub-tasks of Deep RL [21]. To investigate pre-training in deep RL setting, we propose a model for speech command recognition whose details are explained below.

3.1 Pre-Training

Understanding the impact of pre-training on the performance of RL is the primary aim of this study. Using the Speech Command dataset, we trained a conventional supervised DNN model and the model parameters were used to initialise the policy network (see Section 3.3) of the Deep RL. We refer to this process as pre-training. Pre-training helps the model to converge quickly and help improve the accuracy of inference for unseen data during the RL execution.

3.2 Deep Reinforcement Learning Framework

Figure 1: Framework of the proposed Deep RL.

The reinforcement learning framework mainly consists of two major entities namely “agent” and “environment”. The action decided by the agent is executed on the environment and it notifies the agent with the reward and next state in the environment. In this work, we focus on deep RL that involves a DNN structure in the agent module to resolve the action taken by observing the state which is illustrated in Figure 1

. We modelled this problem as a Markov decision process (MDP)

[22]. This can be considered as a tuple , where is the state space, is the action space, is the state transition policy, and is the reward function. Since the core goal of this problem is classification, we modelled the MDP in such a way that the predicted classes are to be as actions, , and the states, are the features of each audio segment in a batch of size . An action decision is carried out by an RL agent which receives a reward () using the following reward function:



is the ground truth value of the specific speech utterance. We modelled the probability of actions using the following equation:


where is the class index of the maximum probability, is the ground truth value of the specific speech utterance and and are the weight and bias values. is the output from the previously hidden layer.

The target of the RL agent is to maximise the expected return using the following policy:


where is the policy of agent, and is the expected reward return at state . To update the policy, we utilise the policy network. Details on the policy network are presented next.

3.3 Policy Network

The policy network model consists of a speech command recognition model as shown in figure 2.

Figure 2: Speech command recognition model architecture.

The policy network learns to generate a definite output for a particular input in an RL algorithm. In this work, the policy network takes speech features as input state and recognises the spoken command. For this, we use a deep network consisting of convolutional (CNN) and Long Short Term Memory (LSTM) layers. Our choice of CNN-LSTM is motivated by their ability to learn both temporal and frequency components of speech signals


. An LSTM cell in recurrent neural networks (RNNs) is a memory unit for learning the temporal structure of sequential data

[23], and CNNs are strong in diminishing frequency variations [24]. We assemble CNN layers on top of an LSTM RNN layer. The outputs from the LSTM RNN layer are then passed on to fully connected layers to learn discriminative features during training [25]. In this way, our proposed policy network is empowered by convolutional layers for learning high-level abstraction, an LSTM RNN layer to capture long-term temporal context, and finally fully connected layers for learning discriminative representation.

3.3.1 Trainable Model

Figure 3: Trainable model architecture.

To calculate the accuracy, we created a separate network by stacking the loss function on top of the output of the policy network and the “target model” as shown in Figure 

3. The target model is of the same architecture of the policy network and it updates weights from the policy network once every 500 episodes. This target model is used to infer the target values ().

3.3.2 REINFORCE Algorithm

The REINFORCE algorithm is used to approximate the gradient to maximise the objective function mentioned in Equation 3.

1 initialise state space;
2 initialise policy network model;
3 pre-train policy network;
4 retrieve initial state ;
5 for  to  do
6       initialise ;
7       while !d do
8             get action();
9             , , );
10             ;
11             ;
12             ;
14       end while
15      ;
17 end for
Algorithm 1 REINFORCE algorithm implementation
Input: History<Rewards,States,Actions>
1 predict from the target model (States);
2 policy network output;
3 Reward + Discounted Reward;
4 lambda: clipped error();
5 Gradient descent on inputs=, output= with loss function ;
Algorithm 2 Training the policy network

Algorithm 1 describes the algorithmic steps followed throughout the RL action prediction process, where indicates the maximum number of episodes to run (10,000 experiments). At the beginning of each episode, a subset of the initial dataset (N=50) is selected randomly as the state space . is the state at instant , is the predicted action for the at the instant, is the reward obtained by executing the predicted action , is a boolean flag indicating the end of an episode, where the end of the episode is decided when reaches the step size (50). are arrays collecting the values of , , for each step, which is consumed by the policy model’s training method described in Algorithm 2. Training is carried out at the end of each episode and “target model” update its weights from policy network after every 200 episodes

4 Experimental Setup

4.1 Dataset

To evaluate the proposed framework, we used the publicly available Speech Commands Dataset. The speech commands dataset [26] contains utterances of 30 command keywords spoken by 2,618 speakers. Each utterance represents a one-second file with a sampling rate of 16 kHz. This dataset contains mainly two subsets of command keywords, namely “main commands”, and “sub commands”. Table 1 shows the distribution of the 30 keywords among the two subsets.

Subset Commands
Main Commands one, two, three, four, five, six, seven, eight, nine, down, go, left, no, off, on, right, stop, up, yes, zero
Sub Commands bed, bird, cat, dog, happy, house, Marvin, Sheila, tree, wow
Table 1: Distribution of keywords in the Speech Commands Dataset

Only 10% of the speech commands dataset was separated for the pre-training step and the remaining 90 % was used by the RL environment.

4.2 Feature Extraction

We use Mel Frequency Cepstral Coefficients (MFCC) to represent the speech signal. MFCCs are very popular features and widely used in speech and audio analysis [25, 27]. We extract 40 MFCCs from the Mel-spectrograms with a frame length of 2,048 and a hop length of 512 using Librosa [28].

Figure 4: Performance evaluations of the model on three different scenarios
Figure 5: Standard Deviation of the accuracy with the episode on three different scenarios

4.3 Model Recipe

We use the Tensorflow library to implement the policy network, which is a combination of CNN and LSTM. The initial layers are 1d convolution layers wrapped in time distributed wrappers with filter sizes of 16 and 8, respectively, followed by a max-pooling layer. The feature maps are then passed to an LSTM layer of 50 cells for learning the temporal features. A dropout layer of dropout rate 0.3 is used for regularisation. Finally, three fully connected layers of 512, 256, and 64 units respectively are added before the softmax layer.

The input to the model is a matrix of , where is the number of MFCCs (40), and

is the number of frames (87) in the MFCC spectrum. We use a stochastic gradient descent optimiser with a learning rate of

. The

pre-training steps were carried out with stochastic gradient descent as the optimiser with a learning rate of 0.001. The model was trained for 10 epochs with a batch size of 8 and 10 % as validation split.

The “target model” does not update the weights during the training phase but updates the weights after every 200 episodes with the weights from the policy network. The “‘loss” tensor in the trainable model takes outputs from the “target model” and policy network as inputs, then calculates the loss at the end of each episode. This loss is minimised through the Adam optimiser. This adjusts the weights of the policy network towards the optimum.

Accuracy of each episode () is calculated by Equation 4. Where is the number of correct predictions and is the total number of steps per episode. We use .


5 Results

To benchmark the results of the RL accuracy, we train a DNN with the same model configuration as of our policy network. We use 80% of data for training and 20% for testing. We use Stochastic gradient descent as the optimizer, where we use learning rate and batch size 32. We present the comparison results in Table 2 .

Classes Binary 20 Classes 30 Classes
Accuracy (%)
Table 2: Benchmarking results using the model in supervised training. Given are the different number of classes.

Experiments were carried out to identify the impact of pre-training on the training-time and accuracy of the RL Agent. Three subsets of speech command datasets were selected, namely “binary”, “20 class”, and “30 class”. The binary subset contains only the speech commands “left” and “right”. 20 classes and 30 classes subsets contain ”main” commands and the merge of “main” and “sub” commands, respectively in the “Speech Command” dataset.

We perform experiments using the proposed deep RL model on each subset and report the results in Tables 3. Table 3 provides the mean accuracy of 200 initial episodes for “with” () and “without” () pre-training. We observe that for all classification subsets, non-pre-trained RL gain considerably lower accuracy for the initial 200 episodes. However, while using pre-training, using the same number of episodes we achieve significantly higher accuracy. This essentially shows that using pre-training we are able to reduce the training time significantly.

Table 3 also shows the mean accuracy of the latest 5 episodes after 10,000 episodes for the “with” () and “without” () pre-training scenarios. Pre-trained RL after 10,000 episodes suppresses the benchmark results on every experiment reported in Table 2. The improvement column “‘” shows the increment of the accuracy of the “with pre-training” with respect to the “without pre-training” scenarios. Each improvement is significant, which further strengthen our findings that pre-train can reduce the training time for Deep RL.

# Classes Initial 200 episodes After 10000 episodes
2 60.13 81.24 21.11 80 100 20
20 7.43 52.11 44.68 25.71 87.76 62.04
30 6.05 41.92 35.87 26.12 79.59 53.47
Table 3: Improvement of the accuracy in % with (w/) and without (w/o) pre-training at 200 initial episodes and after 10,000 episodes. Also shown is the difference .

To further demonstrate the improvement in training time, the accuracy of the episodes was plotted against the episode number and presented in Figure 4. One can observe that the pre-training has increased the overall accuracy in each of the 3 experiments. Also, when the rate of change of accuracy is observed within the initial 2000 episodes it can be seen that the rate of change of accuracy is increased in all the pre-trained experiments. This infers that the number of episodes needed to achieve a defined accuracy is reduced by pre-training. Hence the efficiency is improved.

Lower standard deviation indicates higher consistency. Standard deviation of the accuracy is plotted against the episode in Figure 5 and it can be observed that the standard deviation has decreased rapidly in all the pre-trained experiments. This observation deduces that the pre-training improves the consistency of the predictions earlier.

6 Conclusions

In this paper, we propose the use of pre-training in deep reinforcement learning for speech recognition. The newly introduced framework uses pre-training for feature learning in a reinforcement learning problem. The learned feature knowledge through pre-training is used by Policy Learning during the reinforcement execution to achieve higher accuracy within a reduced time. We evaluate the proposed RL model using the Speech Command dataset for three different classification scenarios, which include binary (two different speech commands), and 20 and 30 class tasks. The results show that pre-training improves the time-efficiency of RL, helping to achieve considerably better results in a significantly smaller number of episodes compared to without using pre-training for RL.