Deep learning aims to minimize the use of domain knowledge in feature representation. However, the model architecture is often designed to incorporate domain-specific nature. For example, in audio domain where input data (i.e., waveforms) are high-dimensional and information for a given task spreads over time, many of effective models use not only Convolutional Neural Networks (CNNs) as a local feature extractor but also Recurrent Neural Networks (RNNs) to learn the temporal dependency on top of the local features in the audio data. This combination of CNNs and the RNNs, which are called convolutional recurrent neural networks (CRNNs), have been widely used in various audio classification tasks, for example, keyword spotting[zeng2019effective, sercan2017crnn_kws], music auto-tagging [choi2017crnn], and sound event detection [xu2018large, cakir2017crnn_polysound, cakir2018end, moon2018end].
While this approach reflects the nature of the input data in a highly abstract manner, there have been attempts to exploit the domain knowledge more directly in designing the model, particularly leveraging the mechanism of neural processing in the human brain. One of the aspects is that the neural signal processing is bidirectional; it has afferent projections, the feedforward pathway from sensor (e.g., eyes or ears) to the brain, and efferent connections, the feedback pathway from the brain to the sensor. The feedback signals control the sensitivity of the feed-forward activations [Lyon:2017]. Inspired by this bidirectional connections in the brain, researchers have proposed new neural networks architectures. For example, Shi et al. proposed ShuttleNet that consists of multiple recurrent modules with a feedback-loop connection. They plugged it into a CRNN framework and achieved superior results in action recognition [Shi2016LearningLD]. Chung et al. introduced Gated Feedback Recurrent Neural Networks (GF-RNN) which have feedback connections from upper recurrent layers to lower layers (although they did not explicitly mention that the model was bio-inspired, both feedback and gating are concepts in the neural processing) [chung2015gatedfeed]. In the line of biologically-inspired models, we also propose a new CRNN architecture with temporal feedback connections in this paper. Unlike the two previous models, the feedback from the recurrent module are connected to the convolutional layers in the lower and scales the channel-wise activations. The idea of channel-wise scaling was used in squeeze-and-excitation networks (SENets) [hu2018senet] to improve the representational power of the model but we compute the scaling values from the hidden states of the RNN module at the previous time step. This efferent feedback control of filter outputs to lower layers is conceptually similar to the mechanism of the outer-hair cells [Lyon:2017].
We evaluate the proposed temporal feedback CRNN (TF-CRNN) on keyword spotting, which is also known as speech command recognition. In such audio classification task where the input is sequential and the output is a single label, many-to-one RNN is a typical choice for the input/output setup. Alternatively, we can use many-to-many RNN by replicating the label at every time step or simply use the one-to-one setting by using CNN blocks only. We show that the proposed model consistently outperforms compared models in the different RNN input/output setups. Furthermore, we investigate the effect of temporal feedback in keyword spotting by a failure analysis. Finally, we visualize the channel-wise excitations in the CNN modules to better understand the operation of the feedback controls in TF-CRNN.
2 Temporal Feedback CRNN
This section introduces the architecture of Temporal Feedback Convolutional Recurrent Networks (TF-CRNN) and input/output setup for keyword spotting.
2.1 Overall architecture
Figure 1 shows the overall architecture of the proposed TF-CRNN. It is built based on SampleCNN, a deeply stacked CNN with a very small size of 1D convolutional filters [lee2017sample, lee2018samplecnn]. SampleCNN was designed to take raw waveforms directly and showed superior performance in audio classification tasks [kim2019comparison]. In order to incorporate SampleCNN into the CRNN framework, we configure the CNN module to take a short segment of waveforms and slide it with 50% overlap at each time step . Therefore, the size of segment at each time step is considered to be much smaller than the input size in the SampleCNN configurations and accordingly the CNN module can be less deep. The RNN module takes the output of the last convolutional layer and produce hidden states at every time step. A hidden state at the previous time step is fed back into convolutional blocks at the current time step and then the RNN module produces the next hidden state . The output of the RNN module is connected to a fully connected layer which makes a prediction for the output.
2.2 Feature Scaling by Temporal Feedback
We use the feedback connections from the RNN module at the previous time step to scale individual feature activations in the convolutional blocks at the current time step. The idea of channel-wise scaling was borrowed in SENets [hu2018senet]. In SENets, the feature scaling values are learned via a separated feedforward block from the convolutional layer. Specifically, the feedforward block consists of two operations: squeeze and excitation. The squeeze operation takes global average pooling of feature maps over time. Therefore, feature maps are reduced to channel-wise statistics where is the input length and is the number of channels (or filters). The excitation operation takes the channel-wise statistics as inputs and computes the scaling values with a range of through two fully-connected layers. Unlike the SENets, TF-CRNN uses the feedback signals from the previous time step to compute the scaling values and use a single fully-connected layer to match the dimentionality with the feature in the convolutional layer. Figure 2 illustrates the process of feature scaling in TF-CRNN. The filter scaling can be regarded as a soft gating mechanism applied to each channel separately to improve the representational power of the network [hu2018senet]. The scaling values (i.e., excitations) tend to be more discriminant as the layer goes up. For audio classification, however, when different classes of audio have different levels of loudness, the scaling values in the first layer become highly class-specific and thus normalize the low-level feature activations according to the loudness level [kim2019comparison]. In Section 5, we will show that the exciations in TF-CRNN also have similar characteristics to those in the SENet.
2.3 Input/Output setup
As aforementioned in Section 1
, we can use either many-to-one or many-to-many RNNs for the keyword spotting task. Many-to-one RNN may loss information at the very beginning because it computes the loss function only once at the last time step in training. On the hand other, many-to-many RNN with repetitive output labels cannot appropriately handle the loss in the beginning part. Thus, we compare two setups in the CRNN framework. Note, in the test phase, we use the many-to-one setup to predict the label.
3 Experimental Setup
We use the Speech Commands dataset for keyword spotting [warden2018speechcommands]. The dataset contains 84,843, 9,981, and 11,005 audio examples in training, validation, and test splits, respectively. All the files have one second in length and each of them contains one utterance of commands. The number of commands (or keywords) is 35 which corresponds to the number of the output classes.
3.2 Implementation details
The models take raw waveforms directly as input. The one-second audio waveforms are used as input in both training and testing phases. Each of the convolutional blocks consists of a convolution layer, rectified linear unit activation function, batch normalization, and max pooling. The number of convolutional filters at each block is denoted in Figure1chung2014gru] to implement the RNN module and initialized the hidden states with zeros.
All neural networks are trained with a batch size of 23 using stochastic gradient descent with Nesterov momentum of 0.9. The initial learning rate is set to be 0.1 and decayed by a factor of 5 when a validation loss does not decrease for 3 epochs. The training is stopped when the validation loss reaches the third plateau. We inserted a dropout with a ratio of 0.5 in the convolutional modules. We use PyTorch[paszke2017pytorch] to built and train the model with the datasets. The source code is available at the link111https://github.com/tae-jun/temporal-feedback-crnn.
|SampleCNN [kim2019comparison]||one-to-one||0.9497 (0.0011)|
|AttentionCRNN [de2018neural]||many-to-one||0.9390 (–)|
Performances of models on keyword spotting. The scores are averages of 3 runs. Standard deviations are denoted in parentheses.
4 Results and Discussion
4.1 Ablation study
We conducted a grid search for the size of time step for each setting to find the optimal size. Note that, since we maintain 50% overlap between two receptive fields of the CNN module at adjacent time steps, the size of time step determines the size of the receptive field of the CNN module and in turn the depth of the CNN module. Specifically, the number of convolutional blocks is calculated as . The size of time step also determines the period of temporal feedback in TF-CRNNs.
Figure 3 shows accuracy scores of the models on keyword spotting. The CRNN is a model of which temporal feedbacks are removed from the TF-CRNN. The general trends show that the TF-CRNNs produce superior performances to CRNNs. In the input/output setup, many-to-many produce outperform many-to-one in all time steps for the same model. The best choice of the time step size varies depending on the model settings. 50ms is the best for TF-CRNNs whereas 100ms is the best for the CRNN with many-to-many setup and 250ms for the CRNN with many-to-one setup. This result indicates that the temporal feedback connections consistently improve the performance and the many-to-many setup is more effective in the keyword spotting task.
Table 1 compare the results on the CRNN models to those from other models in previous studies. It shows that the CRNN in our experiment are superior to both SampleCNN with one-to-one [kim2019comparison] and AttentionCRNN with many-to-one and mel-spectrogram input [de2018neural], and TF-CRNN achieves the best performance.
4.2 Failure analysis
We investigate the effects of the temporal feedback further by conducting a failure analysis of misclassified keywords. The top of Figure 4 shows F1-score differences between CRNNs and TF-CRNNs for each keyword. The positive value indicates that TF-CRNNs outperform CRNNs for each keyword. We can observe that the performance differences are prominent for “learn”, “forward”, “nine”, “backward”, “cat” and so on. To analyze the errors and differences in details, we visualize the confusion matrices for the two models in the bottom of Figure 4. We selected top-3 keywords (“forward”, “learn”, and “nine”) and their confusing keywords (“follow” and “left”). We normalized the matrices by the total numbers of each true label, which displays the diagonal values as recall scores for each keyword. The models tend to misclassify a group of words with similar pronunciations such as “learn” as “left” or “nine”, and “follow” as “forward”. However, these errors decrease by half in TF-CRNNs.
5 Analysis of Temporal Feedbacks
The role of temporal feedback in TF-CRNN is adjusting the strength of channel-wise activations using accumulated information from the upper level and, by doing so, increasing the representational power of the model. To understand the behaviors of TF-CRNN better, we computed statistics of the excitations over time from the temporal feedback in keyword spotting. We performed the analysis on a trained TF-CRNN with many-to-many setup using the test set in keyword spotting. Figure 5 shows the summarized temporal excitations of four different keywords (“backward”, “bird”, “stop”, and “up”) from the first to the last block. Comparing them to root mean square (RMS) energy curves of input waveforms, we can observe that the excitations in the first block have opposite trends to the energy curves. When the energy levels increase, the excitations become smaller and in turn they suppress the sensitivity of feature activations. When the energy levels decrease, the the excitations amplify the sensitivity. This behavior was analogous to the operation in the outer hair cells [Lyon:2017]. The SENets also have similar patterns in the excitations [kim2019comparison] but the difference is that the temporal feedbacks in TF-CRNN is from the upper layer and they are used to normalize input signals in the next time step. As the layer goes up, the amplitude of excitations becomes attenuated. In the last layer, however, the trends are flipped (closer to the energy curves) and become more discriminant for the four classes. This is also similar to the trend in the SENets [hu2018senet, kim2019comparison].
We proposed TF-CRNN, a novel architecture of neural networks inspired by the efferent connections in the human brain. TF-CRNN performs channel-wise scaling in convolutional blocks by taking temporal feedbacks from the RNN module and controls the sensitivity of channel-wise feature activations. We evaluated our models on the keyword spotting task. The experiments demonstrate that the proposed model outperforms all compared models. We also conducted a failure analysis to see the improvement in detail and visualized the excitations to better understand the behaviour of temporal feedbacks. In this paper, we show the potential of TF-CRNN in audio classification tasks. For future work, we plan to evaluate the proposed model on other tasks such as speaker verification, music auto-tagging, acoustic event detection, and automatic music transcription.