Sound event detection (SED) task involves labeling the time stamps of a sound event in audio streams and detecting the sound type. Speech and non-speech sounds such as laughter and music contains lots of useful information. Being able to detect environmental sound events in multi-channel audios can greatly help us understand surrounding acoustic environments and enables many applications such as audio surveillance and rare sound detection [kotus2014detection, foggia2016audio, crocco2016audio]. It is also helpful to improve the performance of speech enhancement and separation systems if we could know the types of sounds [stowell2015detection, kong2018joint]. Robotic systems can employ SED for navigation and natural interaction with surrounding acoustic environments [takeda2016sound, he2018deep]. Smart home devices can benefit from it for environmental sound understanding [southern2017sounding, kao2018r]. Sound event detection is attracting more and more attention nowadays from these applications, as well as multimedia content retrieval [xu2008audio, jin2012event] and audio segmentation [kumar2016audio, tian2015use, wichern2010segmentation].
Real life audio recordings typically have many overlapping sound events, the task of recognizing all the overlapping sounds is considered as polyphonic SED. Lots of efforts have been proposed to address this task by predicting frame-wise labels of each sound event class. Models like Gaussian Mixture Model (GMM)[atrey2006audio]
, Hidden Markov Model (HMM)[mesaros2010acoustic], Recurrent Neural Networks (RNN) [hayashi2017duration, parascandolo2016recurrent]
, and Convolutional Neural Networks (CNN)[hershey2017cnn, zhang2015robust] have been explored extensively for this task. More recently successful results were obtained by stacking CNN, RNN and FC layers consecutively, referred jointly as the convolutional recurrent neural network (CRNN) [cakir2017convolutional]. In order to improve the recognition of overlapping sound events, several multi-channel SED methods have also been proposed. For example, Sharath [adavanne2017sound, adavanne2018sound, adavanne2018multichannel] showed that Deep Neural Network can directly learn from low-level features such as generalized cross-correlation with phase based weighting (GCC-PHAT) for multi-channel sound detection.
The basic building block for most SED models is the convolutional layer, which learns filters capturing local spatial patterns along all the input channels and generates feature maps jointly encoding the time-frequency and channel information. A lot of work has aimed at improving the joint encoding of spatial and channel information [dai2017deformable], but much less attention has been given towards encoding of time-frequency and channel-wise patterns independently with domain information. Some recent work attempted to address this issue by explicitly modeling the inter-dependencies between the channels of feature maps. One promising approach to accomplish this is a component called ”Squeeze & Excitation” (SE) block [hu2018squeeze, roy2018concurrent], which can be seamlessly integrated into the CNN model. This SE block factors out the spatial dependency by global average pooling to learn a channel specific descriptor, which is used to rescale the input feature map to highlight only useful channels.
In this study, we introduce a time-frequency-channel Squeeze and Excitation (tfc-SE) block for multi-channel sound event detection. Different from the original SE block mentioned above, which uses global average pooling and only excites on the channel axis, we adapt and extend it to both the time-frequency and channel domains. Multi-channel audio contains different information at each time-frequency location, for example, we may pay more attention to high energy parts in the spectrogram. We first introduce a time-frequency SE (tf-SE) block that aims to adaptively recalibrate the learned feature map. It does not change the receptive field, but provides time-frequency attention to certain regions. Further, we propose two methods to combine the channel wise SE and time-frequency SE. These methods aggregate the unique properties of each block and make feature maps to be more informative on both domains. We show that with the best tfc-SE block, error rate of the SED system decreases from 0.2538 to 0.2026, relative 20.17% reduction of ER. It also has 5.72% relative improvement of F1 score compared to the original CRNN.
2 Sound Event Detection Systems
2.1 Baseline Convolutional RNN
We use a recently proposed Convolutional Recurrent Neural Network (CRNN) [cakir2017convolutional] to learn the acoustic representations of multi-channel audio signals for sound event detection. The CRNN model has three components, the convolutional layers to learn the time-frequency representations of audio waveform, the recurrent layers to learn the temporal information, and the final classification layer for sound event detection. The model configuration is given in Table 1. We use three layers of 2D CNN to learn the shift invariant features from multi-channel spectorgrams. Each CNN layer has filters of a-frame input features are kept unchanged.
|Type||Filter shape||Input shape|
The output activation from CNN is further fed to bidirectional RNN layers which are designed to learn the temporal context information from the CNN output activations. Specifically,
nodes of Gated Recurrent Units (GRU) are used in each layer with tanh activations. For classification, we use two fully connected (FC) layers. The first FC layer hasnodes each with linear activation. The last FC layer consists of nodes with sigmoid activations, each corresponding to one of the sound event classes to be detected.
In the following sections, we introduce the ”Squeeze & Excitation” (SE) blocks. We insert the SE block after each convolutional layer to adaptively recalibrate the feature representations.
2.2 Channel wise squeeze and excitation
2.2.1 Squeeze operation
The channel-wise SE block, c-SE, is illustrated in Fig. 1 (a). In order to model the inter-dependencies between multiple channels of audio signals, we firstly define a squeeze operation to embed the global time-frequency information into a channel descriptor. We consider the input feature map of the multi-channel audio as where is the feature matrix of channel
. We use a global average pooling layer to generate a channel-wise vectorwith its -th element,
This operation embeds the global time-frequency information into the vector . This vector contains statistics that are expressive of the whole time-frequency input.
2.2.2 Excitation operation
In order to capture channel wise dependencies, a gating mechanism with a sigmoid activation function is used to learn a non-linear relationship between channels,
where and are weights of two fully-connected layers and is the ReLU operator. The dimensional reduction factor indicates the bottleneck in the channel excitation block. Note that the original channel dimension is recovered by the second FC layer. With a sigmoid layer , the channel-wise attention vector is obtained. Finally we recalibrate with the attention vector as,
is the final channel wise recalibrated features. In this block, the input features are attentively scaled so that important channels are emphasized and less important one are diminished.
2.3 Time-frequency squeeze and excitation
Here we introduce the time-frequency wise squeeze and excitation block (tf-SE), shown in Fig. 1 (b). The concept is similar to the c-SE and the main difference is to compress the feature map along the channel axis using a 1-by-1 convolution [lin2013network]. The excitation operation is performed on the time-frequency map. We assume that the time frequency space may contain more information for SED.
We write the input feature map in an alternate form,
where is the feature bin at the time-frequency location . The squeeze operation of the tf-SE is done using a 1-by-1 convolution which can be represented by the linear combination of all channels at a location ,
where is the filter coefficients and is the -th element of the squeezed matrix
. We use a sigmoid functionto limit the range of the matrix to [0, 1], which is used to recalibrate the features on the time-frequency domain. Each value corresponds to the relative importance in the time-frequency space.
Similar to the c-SE, the excitation of the tf-SE is carried out as
and is the -th element of the recalibrated output .
2.4 Concurrent and sequential time-frequency-channel SE
Each of the above explained c-SE and tf-SE blocks has its unique properties. The c-SE blocks recalibrates the channel information by incorporating global time-frequency information. On the other hand, the tf-SE block generates an time-frequency attention map, indicating where the network should focus more to aid the sound classification.
We propose two ways to combine the complementary information from these two SE blocks to form the time-frequency-channel SE (tfc-SE): 1) Concurrent recalibration, illustrated in Fig. 2 (a) and 2) Sequential recalibration in order of channel first and then time-frequency as in Fig. 2 (b).
For the concurrent tfc-SE, we present four ways to aggregate the c-SE and the tf-SE blocks.
Addition: the two recalibrated feature maps are added element wise with equal weights.
Multiplication: the feature maps are multiplied element wise.
Maximization: each location of the output feature map has the maximum activation of the two feature maps element-wise.
Concatenation: the two feature maps are concatenated along the channel axis, which means that the volume of the input feature is twice larger.
The concurrent tfc-SE and sequential tfc-SE with these aggregation strategies will be evaluated in the following sections.
3 Experimental Setup
To study the effectiveness of the CRNN model with the tfc-SE module in multi-channel sound event detection, we used the synthetic eight-channel TUT Sound Events 2018 - Circular array, Reverberant and Synthetic Impulse Response (CRESIM) dataset [adavanne2018sound]. The dataset synthesizes the DCASE 2016 task 2 dataset [dcase2016], which has 11 isolated sound event classes such as speech, door slam, phone ringing, coughing, and keyboard.
Each of the sound classes has 20 examples, of which 16 are randomly chosen for the training set and the rest 4 for the test set, in total 176 examples from 11 classes for training, and 44 for testing. We selected the O3 subset which has maximum three temporally overlapping sources. It consists of three cross-validation splits with 240 training and 60 testing recordings of length 30 sec sampled at 44100 Hz. In this dataset, circular microphone array recording is simulated with 8 omnidirectional microphones equally spaced on 5cm radius. More details can be found in [adavanne2018sound].
3.2 Feature extraction
On frame basis, we compute the magnitude and phase of spectrograms and concatenate them along the channel axis as input features. A Hamming window of length with 50% overlap is used to extract the spectrogram from each audio channel. The zeroth bin is excluded from the spectrogram, so that each frame produces a
feature matrix. The input feature is mean and variance normalized on the frame level. A sequence of the-frame spectrograms are stacked and fed into the network.
3.3 Evaluation metrics
We evaluated our sound detection model using the standard polyphonic SED metrics, error rate (ER) and F-score calculated on segments of one second with no overlap as proposed in[mesaros2016metrics]. F1 (ideally F1 = 1) is based on true and false positives, and the error rate (ER) (ideally ER = 0) is based on the total number of active sound event classes in the ground truth. A joint SED error score can be considered as .
3.4 Model configuration and training
We explored various CRNN models and feature configurations. It was found from preliminary experiments that the best input feature setting is a window length of and sequence length of
(equivalent to 1.486 sec). The CRNN model was trained on the CRESIM dataset for 1000 epochs with a batch size of 64. Early stopping was applied if there is no improvement on thescore after 100 epochs. We built a various models with different parameters such as the number of nodes of CNN/RNN/FC and CNN filter size. The best baseline setting can be found in Table 1. The last fully connected layer has 11 nodes with sigmoid activation and the cross entropy loss for classification was employed for network training. We used an Adam optimizer with the learning rate
. For each sound event class all the predicted probabilities in the sequence () were examined at the frame level and the class detection was claimed as true if there is any probability which is larger than a threshold of 0.5.
4 Results and Discussions
4.1 Experimental results
In order to thoroughly evaluate our proposed methods, we will conduct detailed ablation analysis in this section. We first perform experiments on the c-SE, followed by the tf-SE, and then the sequential tfc-SE for sound event detection. We further investigate the aggregation strategies for the concurrent tfc-SE. Finally we analyze various factors that may affect the performance of SE blocks.
From Table 2, we observe that our proposed channel wise SE (c-SE) block improves both F1 and ER compared with the original CRNN model. After utilizing the c-SE block, overall error score of the model decreases from 0.2285 to 0.2194, relative 3.98% gain. This approach also achieves relative 6.30% improvement in ER. We also find a consistent performance improvement with the time-frequency SE (tf-SE) block, relative 4.08% and 12.02% improvement in F1 and ER, respectively, compared with the original CRNN model. The above results indicate that the tf-SE is more efficient than c-SE, which aligns with our assumption that the time-frequency space may have more meaningful information than channel for the SED task.
Finally, we test the tfc-SE block which combines the c-SE and t-SE activations. Both the concurrent and sequential models improve the performance of the original CRNN model by a large margin. The best result is obtained with the sequential tfc-SE block, which achieves 84.23% F1 score and 0.2026 error rate, relative 5.72% and 20.17% improvement compared with the original CRNN. In terms of the overall score, the sequential tfc-SE block outperforms the original CRNN model by 21.18% relatively.
4.2 Aggregation strategies
We further investigate the aggregation strategy of the concurrent tfc-SE block, among the four choices. We observe from Table 3 that all aggregation methods increase the SED performance against the original CRNN. Using maximization provides the best performance. As concatenation aggregation increases the model complexity by the doubled number of channels, maximization operator looks a better choice for a lower computational cost. For all other experiments, we use the maximization-based aggregation for tfc-SE blocks.
4.3 Sensitivity to dimension reduction ratio
The reduction ratio introduced in Section 2.2.2
is an important hyperparameter that allows us to vary the capacity and computational cost of the c-SE blocks in the model.Table 4 reveals that the performance does not improve monotonically with increased capacity. This probably comes from overfitting due to a larger model capacity. We find that is a good option and we use it for our other experiments.
4.4 Squeeze and excitation operator
For the squeeze operator, we examine the significance of using global average pooling as opposed to global max pooling (while keeping the excitation operator sigmoid) in Table 5. Though both max and average pooling are effective, average pooling achieves a little better result and we use it all in our paper. We next assess the excitation operator. Two options, ReLU and tanh, are experimented by replacing from the sigmoid (with leaving the squeeze operator global average pooling). It suggests that it is important to choose the sigmoid operator in order to make the tfc-SE block effective.
4.5 Model complexity
The original CRNN model has 496,587 parameters. The tfc-SE block requires only 0.7% additional parameters, which is 500,067 in total.
In this paper, we proposed a convolutional time-frequency-channel squeeze and excitation block for multi-channel sound event detection, in order to model the feature inter-dependencies between channels and the time-frequency locations. The tfc-SE block was inserted after each convolution layer of the CRNN model. The proposed method was evaluated on the CRESIM dataset and it was shown to improve the original CRNN model in terms of both F1 score and error rate by a large margin. These results indicated that the tfc-SE block effectively recalibrates the feature maps adaptively by emphasizing more important channels and time-frequency locations.