Pre-training in Deep Reinforcement Learning for Automatic Speech Recognition

10/24/2019 ∙ by Thejan Rajapakshe, et al. ∙ 0

Deep reinforcement learning (deep RL) is a combination of deep learning with reinforcement learning principles to create efficient methods that can learn by interacting with its environment. This led to breakthroughs in many complex tasks that were previously difficult to solve. However, deep RL requires a large amount of training time that makes it difficult to use in various real-life applications like human-computer interaction (HCI). Therefore, in this paper, we study pre-training in deep RL to reduce the training time and improve the performance in speech recognition, a popular application of HCI. We achieve significantly improved performance in less time on a publicly available speech command recognition dataset.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement Learning (RL) follows the principle of behaviourist psychology and learns in a similar way as a child learns to perform a new task. RL has been repeatedly successful in the past [1, 2], however, the successes were mostly limited to low-dimensional problems. In recent years, deep learning has significantly advanced the field of RL, with the use of deep learning algorithms within RL giving rise to the field of “deep reinforcement learning”. Deep learning enables RL to operate in high-dimensional state and action spaces and can now be used for complex decision-making problems [3].

Deep RL algorithms have mostly been applied to video or image processing domains that include playing video games [4, 5] to indoor navigation [6]. Only a very limited number of studies have explored the promising aspects of deep RL in the field of audio processing in particular for speech processing. In this paper, we study this under-researched topic. In particular, we conduct a case study of the feasibility of deep RL for automatic speech command classification.

A major challenge of deep RL is that it often requires a prohibitively large amount of training time and data to reach reasonable performance, making it inapplicable in real-world settings [7]. Leveraging humans to provide demonstrations (known as learning from demonstration (LfD) in RL has recently gained traction as a possible way of speeding up deep RL [8, 9, 10]. In LfD, actions demonstrated by the human are considered as the ground truth labels for a given input game/image frame. An agent closely simulates the demonstrator’s policy at the start, and later on learns to surpass the demonstrator [7]. However, LfD holds a distinct challenge, in the sense that it often requires the agent to acquire skills from only a few demonstrations and interactions due to the time and expense of acquiring them [11]. Therefore, LfDs are generally not scalable for especially high-dimensional problems.

Pre-training the underlying deep neural network (in Section 

2 we discuss the structure of RL in detail) is another approach to speed up training in deep RL. In [12]

, the authors combine Deep Belief Networks (DBNs)

[latif2018transfer] with RL to take benefit of the unsupervised pre-training phase in DBNs, and then use the DBN as the opening point for a neural network function approximator. Furthermore, in [13], the authors demonstrate that a pre-trained hidden layer architecture can reduce the time required to solve reinforcement learning problems. While these studies show the promise of using pre-trained deep neural networks in non-audio domains, the feasibility of pre-training is not well understood for the audio field in general.

We found very few studies in audio using RL/deep RL. In [14]

, the authors describe an avenue of using RL to classify audio files into several mood classes depending upon listener response during a performance. In 

[15], the authors introduce the ‘EmoRL’ model that triggers an emotion classifier as soon as it gains enough confidence while listening to an emotional speech. The authors cast this problem into a RL problem by training an emotion classification agent to perform two actions: wait and terminate. The agent selects the terminate action to stop processing incoming speech and classify it based on the observation. The objective was to achieve a trade-off between performance (accuracy) and latency by punishing wrong classifications actions as well as too delayed predictions through the reward function. While the authors in the above studies use RL for audio, they do not have a focus on pre-training to improve the performance of deep RL.

In this paper, we propose pre-training for improving the performance and speed of Deep RL while conducting speech classification. Results from the case study with the recent public Speech Commands Dataset [16] show that pre-training offers significant improvement in accuracy and helps achieve faster convergence.

2 Methodology

In this study, we investigate the feasibility of pre-training in RL algorithms for speech recognition. We present the details of the proposed model in this section.

2.1 Speech Command Recognition Model

The considered policy network model consists of a speech command recognition model as shown in figure 1.

Figure 1: Speech command recognition model architecture.

Considering the fact that Convolutional Neural Networks (CNNs) and Long Short Term Memory (LSTM) Recurrent Neural Networks (RNNs) can be combined to improve the performance

[17], we assembled CNN layers on top of an LSTM RNN layer. LSTM RNNs are good at learning the temporal structure of a feature map [18, latif2017variational], and CNNs are strong in diminishing frequency variations [19]. The outputs from the LSTM RNN layer are passed on to fully connected layers to learn discriminative features during training [17]. In this way, our proposed policy network is empowered by convolutional layers for learning high-level abstraction, an LSTM layer to capture long-term temporal context, and finally fully connected layers for learning discriminative representation.

2.2 Deep Reinforcement Learning Framework

Figure 2: Framework of the proposed Deep RL.

The reinforcement learning framework mainly consists of two major entities namely “agent” and “environment”. The action decided by the agent is executed on the environment and it notifies the agent with the reward and next state in the environment. In this work, we focus on deep RL that involves a Deep Neural Network (DNN) structure in the agent module to resolve the action taken by observing the state which is illustrated in Figure 2

. We modelled this problem as a Markov decision process (MDP)

[20]. This can be considered as a tuple , where is the state space, is the action space, is the state transition policy, and is the reward function. Since the core goal of this problem is classification, we modelled the MDP in such a way that the predicted classes are to be as actions, , and the states, are the features of each audio segment in a batch of size . An action decision is carried out by an RL agent which receives a reward () using the following reward function:



is the ground truth value of the specific speech utterance. We modelled the probability of actions using the following equation:


where is the action selection probability, and and are the weight and bias values. is the output from the previous hidden layer. The softmax function is defined as:


The target of the RL agent is to maximise the expected return in the policy


where is the policy of agent, and is the expected reward return at state .

2.2.1 REINFORCE Training

The REINFORCE algorithm is used to approximate the gradient to maximise the objective function .

1 initialise state space;
2 initialise policy network model;
3 pre-train policy network;
4 retrieve initial state ;
5 for  to  do
6       initialise ;
7       while !d do
8             get action();
9             , , );
10             ;
11             ;
12             ;
14       end while
15      ;
17 end for
Algorithm 1 REINFORCE algorithm implementation

Algorithm 1 describes the algorithmic steps followed throughout the RL action prediction process, where indicates the maximum number of episodes to run (10,000 experiments), is the state at instant , is the predicted action for the at instant, is the reward obtained by executing the predicted action , is a boolean flag indicating the end of an episode, where the end of the episode is decided when reaches the step size (50). are arrays collecting the values of , , for each step, which is consumed by the policy model’s training method .

For a given set of examples in the state space , initially, the environment sends the to the RL agent. The RL agent infers the corresponding action probabilities through the policy network and selects the highest probable action and returns it to the environment. The environment then calculates the reward for the action-state combination and returns to the agent with the reward and next state . Each , , and are stored, and then, the policy network is trained with those values in an episode.

3 Experimental Setup

3.1 Datasets

The Speech Commands Dataset [16] is an audio corpus of 105,829 utterances containing 30 command keywords spoken by 2,618 speakers. Each utterance of a one-second file is stored in the ‘.wav’ file format with 16 kHz sampling rate. This dataset contains mainly two subsets of command keywords, namely “Main Commands”, and “Sub Commands”. Table 1 shows the distribution of the 30 keywords among the two subsets.

Subset Commands
Main Commands one, two, three, four, five, six, seven, eight, nine, down, go, left, no, off, on, right, stop, up, yes, zero
Sub Commands bed, bird, cat, dog, happy, house, Marvin, Sheila, tree, wow
Table 1: Distribution of keywords in the Speech Commands Dataset

3.2 Feature Extraction

In this study, we use Mel Frequency Cepstral Coefficients (MFCC) to represent the speech signal. MFCCs are very popular features and widely used in speech and audio analysis/processing [17, 21]. We extract 40 MFCCs from the mel-spectrograms with a frame length of 2,048 and a hop length of 512 using Librosa [22].

3.3 Model Recipe

We use the Tensorflow library to implement the model, a combination of CNN and LSTM: The initial layers are 1d convolution layers wrapped in time distributed wrappers with filter sizes of 16 and 8, respectively, followed by a max-pooling layer. The feature maps are then passed to an LSTM layer of 50 cells for learning the temporal features. A dropout layer of dropout rate 0.3 is used for regularisation. Finally, three fully connected layers of 512, 256, and 64 units respectively are added before the softmax layer. The input to the model is a matrix of

, where is the number of MFCCs (40), and

is the number of frames (87) in the MFCC spectrum. We use a Stochastic gradient descent optimiser with a learning rate of


The score value is defined as the sum of the rewards produced within the episode. The score variable can be utilised to infer the overall accuracy () of the RL Agent within a given episode.


4 Results

Experiments were carried out focusing on the effect that pre-training has on the accuracy of the RL Agent. Three subsets of Speech Command datasets were selected, namely “binary”, “20 class”, and “30 class”. A binary subset contains only the speech commands of the “left” and “right” classes. The 20 classes and 30 classes subsets contain ”Main” commands and the merge of “Main” and “Sub” commands, respectively. Each subset was experimented as without pre-training and with pre-training.

Figure 3: Performance evaluations of the model on three different scenarios
Figure 4: Standard Deviation of the score with episode on three different scenarios

The mean score of a batch of 200 episodes is plotted against the episode number in Figure 3. Interpreting the graphs in Figure 3, one can find that the pre-training increases the overall score. It can be seen that the binary class classification nearly reaches its maximum score at 50 within initial 2,500 episodes. The other 2 subsets show a score over 25 within 10,000 episodes.

The rates of change of score (velocity of score) for the initial 500 and 1,000 episodes were calculated by equation 6 and tabulated in Table 2.


The results in Table 2 convey that the velocity of score increases by pre-training within the initial 1,000 episodes for all 3 subsets of experiments. This lets one conclude that the pre-training can decrease the time taken for the RL agent to converge to better accuracy.

# Classes Change of Velocity of score (%)
Initial 500 episodes Initial 1000 episodes
2 -0.9 4.4
20 15.2 8.8
30 6.4 11.2
Table 2: Velocity of score change in the initial episodes.
# Classes Score after 10000 episodes Improvement (%)
w/o pre-train w/ pre-train
2 29.4 49.0 19.6
20 -23.8 37.0 60.8
30 -23.4 29.0 52.4
Table 3: Improvement of the score with pre-training.

Table 3 shows the mean score of the latest 5 episodes for the “with” (w/) and “without” (w/o) pre-train scenarios. The improvement column shows the increment of score of the “with pre-train” with respect to “without pre-train” scenario, where the improvement is calculated by equation 7. Each improvement is a positive improvement. This concludes that the overall final score (accuracy) of the RL agent policy network model can be improved by pre-training.


According to Figure 4, the standard deviation of the score decreases rapidly with time in the pre-trained scenario. This again shows that predictions of the RL agent are increasing with time and pre-training is accelerating the process.

5 Conclusions

In this paper, we propose the use of pre-training in deep reinforcement learning (deep RL) for speech recognition. The proposed model uses pre-training knowledge to achieve a better score while reducing the convergence time. We evaluated the proposed RL model using the speech command dataset for three different classifications scenarios, which include binary (two different speech commands), and 10 and 30 class tasks. Results show that pre-training helps to achieve considerably better results in a lower number of episodes. In future efforts, we want to study the feasibility of using unrelated data to pre-train deep RL to further improve its performance and convergence.