Reinforcement Learning (RL) follows the principle of behaviourist psychology and learns in a similar way as a child learns to perform a new task. RL has been repeatedly successful in the past [1, 2], however, the successes were mostly limited to low-dimensional problems. In recent years, deep learning has significantly advanced the field of RL, with the use of deep learning algorithms within RL giving rise to the field of “deep reinforcement learning”. Deep learning enables RL to operate in high-dimensional state and action spaces and can now be used for complex decision-making problems .
Deep RL algorithms have mostly been applied to video or image processing domains that include playing video games [4, 5] to indoor navigation . Only a very limited number of studies have explored the promising aspects of deep RL in the field of audio processing in particular for speech processing. In this paper, we study this under-researched topic. In particular, we conduct a case study of the feasibility of deep RL for automatic speech command classification.
A major challenge of deep RL is that it often requires a prohibitively large amount of training time and data to reach reasonable performance, making it inapplicable in real-world settings . Leveraging humans to provide demonstrations (known as learning from demonstration (LfD) in RL has recently gained traction as a possible way of speeding up deep RL [8, 9, 10]. In LfD, actions demonstrated by the human are considered as the ground truth labels for a given input game/image frame. An agent closely simulates the demonstrator’s policy at the start, and later on learns to surpass the demonstrator . However, LfD holds a distinct challenge, in the sense that it often requires the agent to acquire skills from only a few demonstrations and interactions due to the time and expense of acquiring them . Therefore, LfDs are generally not scalable for especially high-dimensional problems.
Pre-training the underlying deep neural network (in Section2 we discuss the structure of RL in detail) is another approach to speed up training in deep RL. In 
, the authors combine Deep Belief Networks (DBNs)[latif2018transfer] with RL to take benefit of the unsupervised pre-training phase in DBNs, and then use the DBN as the opening point for a neural network function approximator. Furthermore, in , the authors demonstrate that a pre-trained hidden layer architecture can reduce the time required to solve reinforcement learning problems. While these studies show the promise of using pre-trained deep neural networks in non-audio domains, the feasibility of pre-training is not well understood for the audio field in general.
We found very few studies in audio using RL/deep RL. In 
, the authors describe an avenue of using RL to classify audio files into several mood classes depending upon listener response during a performance. In, the authors introduce the ‘EmoRL’ model that triggers an emotion classifier as soon as it gains enough confidence while listening to an emotional speech. The authors cast this problem into a RL problem by training an emotion classification agent to perform two actions: wait and terminate. The agent selects the terminate action to stop processing incoming speech and classify it based on the observation. The objective was to achieve a trade-off between performance (accuracy) and latency by punishing wrong classifications actions as well as too delayed predictions through the reward function. While the authors in the above studies use RL for audio, they do not have a focus on pre-training to improve the performance of deep RL.
In this paper, we propose pre-training for improving the performance and speed of Deep RL while conducting speech classification. Results from the case study with the recent public Speech Commands Dataset  show that pre-training offers significant improvement in accuracy and helps achieve faster convergence.
In this study, we investigate the feasibility of pre-training in RL algorithms for speech recognition. We present the details of the proposed model in this section.
2.1 Speech Command Recognition Model
The considered policy network model consists of a speech command recognition model as shown in figure 1.
2.2 Deep Reinforcement Learning Framework
The reinforcement learning framework mainly consists of two major entities namely “agent” and “environment”. The action decided by the agent is executed on the environment and it notifies the agent with the reward and next state in the environment. In this work, we focus on deep RL that involves a Deep Neural Network (DNN) structure in the agent module to resolve the action taken by observing the state which is illustrated in Figure 2
. We modelled this problem as a Markov decision process (MDP). This can be considered as a tuple , where is the state space, is the action space, is the state transition policy, and is the reward function. Since the core goal of this problem is classification, we modelled the MDP in such a way that the predicted classes are to be as actions, , and the states, are the features of each audio segment in a batch of size . An action decision is carried out by an RL agent which receives a reward () using the following reward function:
is the ground truth value of the specific speech utterance. We modelled the probability of actions using the following equation:
where is the action selection probability, and and are the weight and bias values. is the output from the previous hidden layer. The softmax function is defined as:
The target of the RL agent is to maximise the expected return in the policy
where is the policy of agent, and is the expected reward return at state .
2.2.1 REINFORCE Training
The REINFORCE algorithm is used to approximate the gradient to maximise the objective function .
Algorithm 1 describes the algorithmic steps followed throughout the RL action prediction process, where indicates the maximum number of episodes to run (10,000 experiments), is the state at instant , is the predicted action for the at instant, is the reward obtained by executing the predicted action , is a boolean flag indicating the end of an episode, where the end of the episode is decided when reaches the step size (50). are arrays collecting the values of , , for each step, which is consumed by the policy model’s training method .
For a given set of examples in the state space , initially, the environment sends the to the RL agent. The RL agent infers the corresponding action probabilities through the policy network and selects the highest probable action and returns it to the environment. The environment then calculates the reward for the action-state combination and returns to the agent with the reward and next state . Each , , and are stored, and then, the policy network is trained with those values in an episode.
3 Experimental Setup
The Speech Commands Dataset  is an audio corpus of 105,829 utterances containing 30 command keywords spoken by 2,618 speakers. Each utterance of a one-second file is stored in the ‘.wav’ file format with 16 kHz sampling rate. This dataset contains mainly two subsets of command keywords, namely “Main Commands”, and “Sub Commands”. Table 1 shows the distribution of the 30 keywords among the two subsets.
|Main Commands||one, two, three, four, five, six, seven, eight, nine, down, go, left, no, off, on, right, stop, up, yes, zero|
|Sub Commands||bed, bird, cat, dog, happy, house, Marvin, Sheila, tree, wow|
3.2 Feature Extraction
In this study, we use Mel Frequency Cepstral Coefficients (MFCC) to represent the speech signal. MFCCs are very popular features and widely used in speech and audio analysis/processing [17, 21]. We extract 40 MFCCs from the mel-spectrograms with a frame length of 2,048 and a hop length of 512 using Librosa .
3.3 Model Recipe
We use the Tensorflow library to implement the model, a combination of CNN and LSTM: The initial layers are 1d convolution layers wrapped in time distributed wrappers with filter sizes of 16 and 8, respectively, followed by a max-pooling layer. The feature maps are then passed to an LSTM layer of 50 cells for learning the temporal features. A dropout layer of dropout rate 0.3 is used for regularisation. Finally, three fully connected layers of 512, 256, and 64 units respectively are added before the softmax layer. The input to the model is a matrix of, where is the number of MFCCs (40), and
is the number of frames (87) in the MFCC spectrum. We use a Stochastic gradient descent optimiser with a learning rate of.
The score value is defined as the sum of the rewards produced within the episode. The score variable can be utilised to infer the overall accuracy () of the RL Agent within a given episode.
Experiments were carried out focusing on the effect that pre-training has on the accuracy of the RL Agent. Three subsets of Speech Command datasets were selected, namely “binary”, “20 class”, and “30 class”. A binary subset contains only the speech commands of the “left” and “right” classes. The 20 classes and 30 classes subsets contain ”Main” commands and the merge of “Main” and “Sub” commands, respectively. Each subset was experimented as without pre-training and with pre-training.
The mean score of a batch of 200 episodes is plotted against the episode number in Figure 3. Interpreting the graphs in Figure 3, one can find that the pre-training increases the overall score. It can be seen that the binary class classification nearly reaches its maximum score at 50 within initial 2,500 episodes. The other 2 subsets show a score over 25 within 10,000 episodes.
The results in Table 2 convey that the velocity of score increases by pre-training within the initial 1,000 episodes for all 3 subsets of experiments. This lets one conclude that the pre-training can decrease the time taken for the RL agent to converge to better accuracy.
|# Classes||Change of Velocity of score (%)|
|Initial 500 episodes||Initial 1000 episodes|
|# Classes||Score after 10000 episodes||Improvement (%)|
|w/o pre-train||w/ pre-train|
Table 3 shows the mean score of the latest 5 episodes for the “with” (w/) and “without” (w/o) pre-train scenarios. The improvement column shows the increment of score of the “with pre-train” with respect to “without pre-train” scenario, where the improvement is calculated by equation 7. Each improvement is a positive improvement. This concludes that the overall final score (accuracy) of the RL agent policy network model can be improved by pre-training.
According to Figure 4, the standard deviation of the score decreases rapidly with time in the pre-trained scenario. This again shows that predictions of the RL agent are increasing with time and pre-training is accelerating the process.
In this paper, we propose the use of pre-training in deep reinforcement learning (deep RL) for speech recognition. The proposed model uses pre-training knowledge to achieve a better score while reducing the convergence time. We evaluated the proposed RL model using the speech command dataset for three different classifications scenarios, which include binary (two different speech commands), and 10 and 30 class tasks. Results show that pre-training helps to achieve considerably better results in a lower number of episodes. In future efforts, we want to study the feasibility of using unrelated data to pre-train deep RL to further improve its performance and convergence.
Satinder Singh, Diane Litman, Michael Kearns, and Marilyn Walker,
“Optimizing dialogue management with reinforcement learning:
Experiments with the njfun system,”
Journal of Artificial Intelligence Research, vol. 16, pp. 105–133, 2002.
-  Gerald Tesauro, “Temporal difference learning and td-gammon,” Communications of the ACM, vol. 38, no. 3, pp. 58–68, 1995.
-  Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath, “A brief survey of deep reinforcement learning,” arXiv, vol. 2017, no. 1708.05866, 2017.
-  David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al., “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484, 2016.
-  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529, 2015.
-  Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi, “Target-driven visual navigation in indoor scenes using deep reinforcement learning,” in Proceedings ICRA. IEEE, 2017, pp. 3357–3364.
-  Gabriel V Cruz Jr, Yunshu Du, and Matthew E Taylor, “Pre-training neural networks with human demonstrations for deep reinforcement learning,” arXiv, vol. 2017, no. 1709.04083, 2017.
-  Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets, Michelle Yeo, Alireza Makhzani, Heinrich Küttler, John Agapiou, Julian Schrittwieser, et al., “Starcraft ii: A new challenge for reinforcement learning,” arXiv, vol. 2017, no. 1708.04782, 2017.
-  Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Ian Osband, et al., “Deep q-learning from demonstrations,” in Proceedings AAAI, 2018.
-  Vitaly Kurin, Sebastian Nowozin, Katja Hofmann, Lucas Beyer, and Bastian Leibe, “The atari grand challenge dataset,” arXiv, , no. 1705.10998, 2017.
-  Sylvain Calinon, Learning from Demonstration (Programming by Demonstration), pp. 1–8, Springer Berlin Heidelberg, Berlin, Heidelberg, 2018.
-  Farnaz Abtahi and Ian Fasel, “Deep belief nets as function approximators for reinforcement learning,” in Proceedings Workshops AAAI, 2011.
-  Charles W Anderson, Minwoo Lee, and Daniel L Elliott, “Faster reinforcement learning after pretraining deep networks to predict state dynamics,” in Proceedings IJCNN. IEEE, 2015, pp. 1–7.
-  Jack Stockholm and Philippe Pasquier, “Reinforcement learning of listener response for mood classification of audio,” in Proceeings ICCSE. IEEE, 2009, vol. 4, pp. 849–853.
-  Egor Lakomkin, Mohammad Ali Zamani, Cornelius Weber, Sven Magg, and Stefan Wermter, “Emorl: continuous acoustic emotion classification using deep reinforcement learning,” in Proceedings ICRA. IEEE, 2018, pp. 1–6.
-  Pete Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv, vol. 2018, no. 1804.03209.
-  Siddique Latif, Rajib Rana, Sara Khalifa, Raja Jurdak, and Julien Epps, “Direct modelling of speech emotion from raw speech,” arXiv, vol. 2019, no. 1904.03833, 2019.
-  Siddique Latif, Muhammad Usman, Rajib Rana, and Junaid Qadir, “Phonocardiographic sensing using deep learning for abnormal heartbeat detection,” IEEE Sensors Journal, vol. 18, no. 22, pp. 9393–9400, 2018.
-  Rajib Rana, Siddique Latif, Sara Khalifa, and Raja Jurdak, “Multi-task semi-supervised adversarial autoencoding for speech emotion,” arXiv preprint arXiv:1907.06078, 2019.
-  Joel R. Tetreault and Diane J. Litman, “A reinforcement learning approach to evaluating state representations in spoken dialogue systems,” vol. 50, no. 8, pp. 683–696.
-  Steven Davis and Paul Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE transactions on acoustics, speech, and signal processing, vol. 28, no. 4, pp. 357–366, 1980.
-  Brian McFee, Colin Raffel, Dawen Liang, Daniel Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto, “librosa: Audio and music signal analysis in python,” pp. 18–24.