SAM-GCNN: A Gated Convolutional Neural Network with Segment-Level Attention Mechanism for Home Activity Monitoring

10/03/2018 ∙ by Yu-Han Shen, et al. ∙ Tsinghua University NetEase, Inc 0

In this paper, we propose a method for home activity monitoring. We demonstrate our model on dataset of Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 Challenge Task 5. This task aims to classify multi-channel audios into one of the provided pre-defined classes. All of these classes are daily activities performed in a home environment. To tackle this task, we propose a gated convolutional neural network with segment-level attention mechanism (SAM-GCNN). The proposed framework is a convolutional model with two auxiliary modules: a gated convolutional neural network and a segment-level attention mechanism. Furthermore, we adopted model ensemble to enhance the capability of generalization of our model. We evaluated our work on the development dataset of DCASE 2018 Task 5 and achieved competitive performance, with a macro-averaged F-1 score increasing from 83.76 compared with the convolutional baseline system.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recently, sound event detection and classification has become more and more popular in the field of acoustic signal processing, and it can be widely used in security surveillance, wildlife protection and smart home. One important application of sound event classification in smart home is home activity monitoring.

Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge is one of the most important international challenges concerning acoustic event detection and classification and has been organized for several years. DCASE 2018 challenge consists of five tasks and we focus on task 5[1]. This task evaluates systems for monitoring of domestic activities based on multi-channel acoustics.

We can also refer to this task as acoustic activity classification. The main procedure of acoustic activity classification consists of four parts: pre-processing, extracting acoustic features, designing acoustic models as classifiers, and post-processing.

In the part of pre-processing, different methods of data augmentation have been utilized in [2][3]. Data imbalance is a big challenge in acoustic event classification and detection because different events may occur at a completely imbalanced frequency. In DCASE 2018 Challenge Task 5, Inoue et al. used shuffling and mixing to produce more training samples[2], and Tanabe et al. utilized dereverberation, blind source separation and data augmentation to improve the quality of audio clips[3].

Mel Frequency Cepstrum Coefficient (MFCC) is a common traditional acoustic feature and has been widely used. But log Mel-scale Filter Bank energies (fbank) are becoming more popular recently, and many works have been done based on fbank[1][4][5].

In recent years, Convolutional Neural Networks (CNNs) have achieved great success in many fields such as character recognition, image classification, speaker recognition. And many works based on CNNs have been done in acoustic event classification and detection[6][7]

. Besides, some researchers combined CNNs with Recurrent Neural Networks (RNNs) to capture temporal contexts of audio signals for further improvements


Attention model has been widely used in image classification, object detection and natural language understanding. In the field of acoustic signal processing, Xu et al. [8] proposed an attention model for weakly supervised audio tagging and Kong et al. [9] improved this work by giving a probabilistic perspective. Their work is based on the assumption that those irrelevant sound frames such as background noise and silences should be ignored and given less attention. Both of their models are achieved by a weighted sum over frames where the attention values are automatically learned by neural network.

In our work, acoustic activities might last for a longer period and a single frame is not enough to identify whether it should be ignored. In an audio recording, acoustic activities may keep happening in a majority of frames while acoustic event only occurs in a few frames. So we propose a segment-level attention mechanism (SAM) to decide how much attention should be given based on the characteristics of segments. Here, a segment is comprised of several frames.

In this paper, we mainly adopt three ways to improve the performance of our model:

(1) We replace currently popular CNN with gated convolutional neural network to extract more temporal features of audios;

(2) We propose a new segment-level attention mechanism to focus more on the audio segments with more energy;

(3) We utilize model ensemble to enhance the classification capability of our model.

The rest of this paper is organized as follows. In Section 2, we introduce our methods in detail, mainly including acoustic feature, gated convolutional neural network, segment-level attention mechanism and model ensemble. The experiment setup, evaluation metric and our results are illustrated in Section 3. Finally, the conclusion of our work is presented in Section 4.

Ii Methods

Ii-a Task Description

The DCASE 2018 Task 5 dataset[10] contains sound data recorded in a living room by individual devices with four microphone arrays at seven undisclosed locations. The dataset is divided into a development dataset and an evaluation dataset. Four cross-validation folds are provided for the development dataset in order to make results reported with this dataset uniform. For each fold, a training, testing and evaluation subset is provided. In this paper, our work is based on the development dataset and we use the provided cross-validation folds for training and evaluation.

The audio clips in this dataset can be classified into nine classes: absence, cooking, dishwashing, eating, other, social activity, vacuum cleaning, watching TV and working. All audio clips are derived from continuous recording sessions collected by seven microphone arrays and each clip contains four channels. The duration of each audio clip is 10 seconds. Specific information about the dataset is shown in Table 1 and more details can be found in [10].

Activity #10s clips #sessions
Absence 18860 42
Cooking 5124 13
Dishwashing 1424 10
Eating 2308 13
Other 2060 118
Social activity 4944 21
Vacuum cleaning 972 9
Watching TV 18648 9
Working 18644 33
Total 72984 268
TABLE I: Amounts of audio clips and sessions

Ii-B System Overview

Our proposed system is illustrated as Figure 1. The input of our system is log Mel-scaled filter banks (fbank). Then it will be fed into two structures: one is a Gated Convolutional Neural Network (GCNN) architecture, and the other is our proposed Segment-Level Attention Mechanism (SAM).

Unlike most systems that output one probability score for an audio as a whole, we divide a 10-s audio clip into several segments. The output of our GCNN architecture is

and represents the probability for each class of each segment, where is the number of segments in an audio clip and

is the number of predefined classes. The output of SAM is a vector

and represents the attention weight factor for each segment. Then we multiply X with W for each segment to obtain weighted segment scores. Those scores will be averaged among segments to get a vector and then go through a softmax to represent the normalized probability for each class. The class with the largest probability is considered to be the classification result.

Fig. 1: Overall architecture of proposed system

The detailed explanations of our proposed system will be included in the following parts of this section.

Ii-C Acoustic Feature

We use fbank as the input of our system. Fbank is a two-dimensional time-frequency acoustic feature. It imitates the characteristics of human’s ears and concentrates more on the low frequency components of audio signals. Compared with traditional MFCC feature, more original information can be kept in fbank and it has been widely used in deep learning. To extract fbank feature, each input audio is divided into 40ms frames with 50% overlapping, and then 40 mel-scale filters are applied on the magnitude spectrum of each frame. Finally, we take logarithm on the amplitude and get fbank feature. As is mentioned in Section 1, the audio clips contain four channels, so our fbank feature contains four channels as well. In our work, four channels are fed into the system separately while training. And the averaged output score of four channels is used for evaluation.

Fig. 2: Overall architecture of gated convolutional neural network

Ii-D Gated Convolutional Neural Network

Gated convolutional neural network was proposed by Dauphin et al. in [11]

and has shown great power in machine translation, natural language processing. Our GCNN architecture consists of three main parts: 1) convolutional neural network (CNN), 2) gated convolutional neural network (GCNN), 3) feedforward neural network (FNN). And our overall architecture is shown in Figure 2.

Fig. 3: Gated convolutional neural network.

Before being fed into GCNN architecture, the extracted fbank feature is normalized to zero mean and unit standard deviation (we call it global normalization, to distinct with the following time normalization).

Convolutional layers extract frequency features and connect features of adjacent frames. And the output of convolutional layer is followed by batch normalization


, a ReLU activation unit and a dropout layer


. Then a max-pooling layer is applied to keep the most important features.

The structure of gated convolutional neural network is illustrated in Figure 3.

In gated convolutional neural network, the output of convolutional layer is divided into two parts with the same size. The input of this structure is = [e, e, …, e], passes through a convolutional layer and the output is divided into and . Then

passes through sigmoid activation function and multiplies with

by element-wise. In order to enable stronger work, we add residual connections from the input

to the output of this structure

. Residual network is introduced to avoid vanishing gradient problem


The specific formula is as follows:


where , represent convolutional kernel values, and , mean biases. represents element-wise production. is a sigmoid activation function.

The gated convolutional layer is also followed by batch normalization, a ReLU activation unit, a dropout layer and a max-pooling layer.

After the gated convolutional neural network, the features on multiple channels are flattened into frequency axis.

Then two fully-connected layers are used to combine extracted features and output nine scores for each segment. Our work differs from others in that we output scores for each segment while most researchers output scores for an audio as a whole. We intend to focus on those segments with more energy and ignore segments with less energy, which we call “silence” segments. That is why we propose a segment-level attention mechanism.

Ii-E Segment-Level Attention Mechanism

Fig. 4: Segment-Level Attention Mechanism.

As mentioned in Section 1, attention mechanism was introduced to ignore irrelevant sounds such as background noise and silences in audio event classification. In DCASE 2018 task 5, an audio clip labeled as “cooking” may contain some segments of silences and we should not pay too much attention to those segments because audio clips labeled as other classes may also contain silences. Motivated by Xu et al. [8], we propose a segment-level attention mechanism. Our work differs from previous work in that we give our attention weight factors based on the characteristics of segments instead of frames.

The structure of segment-level attention mechanism is shown in Figure 4. The input of this structure is aforementioned fbank feature. Then it will be normalized along the time axis, which we call time normalization. The purpose of time normalization is to further differentiate the features among frames.

A fully-connected layer is added to extract deeper features of frames. Like in the gated convolutional neural network, the fully-connected layer is followed by batch normalization, ReLU and dropout. Next, we calculate the sum along frequency axis. An average pooling layer is added to filter adjacent frames. Then a max-pooling layer is used to maintain the most important information of a segment. Finally, we use a sigmoid activation to limit the weight factors between 0 and 1. Based on our experiments, the duration of a segment is set to 1 second. Specific structure and hyperparameters will be illustrated in Section 3.

Ii-F Model Ensemble

Model ensemble is a common strategy in machine learning. In our work, we propose a strategy of model ensemble.

During our experiments, we notice that “absence”, “other” and “working” are three sorts of activities that are often misclassfied with the others. So we train a model in particular to classify those three classes of activities. When our main system classifies an audio clip as any of the three classes, we will use the specially trained model for one more classification.

If an audio is classified as a class other than class 0, 4, 8 (“absence”, “other” and “working) by our first system, the output will be the final output. Otherwise, the audio will be fed into our second system. We denote the output of our first system as and second system as . represents the output probability of i-th class by the N-th system, where and N is 1 or 2. Then the final output of our ensemble system will be calculated according to the following algorithm. We calculate the sum of and redistribute them based on our second system output .

  if  then
  end if
Algorithm 1 Model Ensemble

Iii Experiment, Evaluation and Results

Iii-a Experiment setup

Our model is trained using Adam [15]

for gradient based optimization. Cross-entropy is used as the loss function. And the structure of our system is shown in Table 2 and Table 3 along with parameters. The initial learning rate is 0.001 and the batch size is 256

4 channels because each channel is considered as a different sample for training. We train the classifiers for 300 epochs.

We select 5% of the testing data as validation dataset and choose models which result in the best accuracy on the validation dataset for final evaluation. In the evaluation process, the outputs of 4-channel acoustics are averaged to get the final posterior probability.

Input 405011 Output size

Conv (padding: valid, kernel: [40, 5, 64])

1, 497, 64
BN-ReLU-Dropout(0.2) 1, 497, 64
15 Max-Pooling(padding: valid) 1, 99, 64
Gated Conv (padding: same, kernel: [1, 3, 128]) 1, 99, 64
BN-ReLU-Dropout(0.2) 1, 99, 64
110 Max-Pooling(padding: same) 1, 10 64
Feature Flattening 10, 64
Fully-connected(unit num: 64) -ReLU-Dropout(0.2) 10, 64
Fully-connected(unit num: 9) 10, 9
TABLE II: Model structure and parameters of gated convolutional neural network
Input 405011 Output size
Fully-connected(unit num: 40) 40, 501, 1
BN-ReLU-Dropout(0.2) 40, 501, 1
Sum along frequency axis 1, 501, 1
15 Average-Pooling(padding: same) 1, 100, 1
110 Max-Pooling(padding: same) 1, 10, 1
Squeeze 10
Sigmoid 10
TABLE III: Model structure and parameters of segment-level attention mechanism

Iii-B Evaluation Metric

The official evaluation metric for DCASE 2018 challenge task 5 is macro-averaged F1-score. F1-score is a measure of a test’s accuracy and it is the harmonic average of precision and recall. Macro-averaged means that F1-score is calculated for each class separately and averaged over all classes. For this task, a full 10s multi-channel audio is considered to be one sample.

Iii-C Results

We examine the following configurations:

(1) CNN: Convolutional neural network as baseline system;

(2) SAM-CNN: Convolutional neural network with our proposed segment-level attention mechanism;

(3) GCNN: Gated convolutional neural network;

(4) SAM-GCNN: Gated convolutional neural network with our proposed segment-level attention mechanism;

(5) Ensemble: Gated convolutional neural network with our proposed segment-level attention mechanism and model ensemble.

System Fold1 Fold2 Fold3 Fold4 Average
CNN 81.92% 82.58% 83.26% 87.29% 83.76%
GCNN 85.58% 84.22% 86.36% 88.83% 86.25%
SAM-CNN 83.68% 82.26% 84.56% 88.09% 84.65%
SAM-GCNN 88.49% 86.81% 86.51% 90.52% 88.08%
Ensemble 89.62% 88.11% 87.95% 91.63% 89.33%
TABLE IV: Macro-averaged F1-score of multiple systems on 4 folds

As shown in Table 4, the macro-averaged F-1 score of GCNN is 2.49% higher than CNN. And our proposed segment-level attention mechanism can improve the classification performance of both CNN and GCNN.

Moreover, our proposed ensemble strategy can outperform previous systems and achieve 89.33% F1-score. Confusion matrix before and after ensemble is shown in Figure 5. On the left is the confusion matrix of SAM-GCNN, and on the right is the confusion matrix of SAM-GCNN with model ensemble. The element in the

i-th row and j-th column of this matrix represents the amount of audio clips that belong to class i and are classified as class j, so the elements on the diagonal represent the number of correctly classified audio clips. We can find that the number of correctly classified audio clips has increased after ensemble, especially for “absence”, “other” and “working”, showing that our model ensemble method does work.

The class-wise performance of our final model is shown in Table 5.

fold1 fold2 fold3 fold4 Average
Absence 94.43% 92.99% 93.15% 94.93% 93.88%
Cooking 95.92% 94.26% 93.75% 96.49% 95.10%
Dishwashing 87.45% 81.22% 81.81% 83.87% 83.59%
Eating 89.35% 89.66% 87.73% 90.56% 89.33%
Other 52.28% 53.51% 54.61% 67.15% 56.89%
Social activity 97.83% 95.85% 94.38% 98.50% 96.64%
Vacuum cleaning 99.99% 99.81% 100.00% 100.00% 99.95%
Watching TV 99.55% 99.86% 99.42% 99.91% 99.69%
Working 89.82% 85.85% 86.68% 93.22% 88.89%
Macro-Average 89.62% 88.11% 87.95% 91.63% 89.33%
TABLE V: Class-wise performance of proposed model

To better evaluate our work, we compare the performance of proposed model with the top-2 ranked teams in DCASE 2018 Challenge Task 5 and the official baseline system in Table 6. Both of the top-2 teams adopted complex methods of pre-processing, data augmentation and model ensemble. We can achieve equivalent performance without any data augmentation. And our system outperforms the official baseline significantly.

Averaged F1-score
Proposed 89.3%
InouetMilk[2] 90.0%
HITfweight[3] 89.8%
Official Baseline 84.5%
TABLE VI: Comparison with state-of-the-art works

Fig. 5: Confusion matrix before and after ensemble on fold4.

Iv Conclusion

In this paper, we have introduced our work and the results show that the performance of our proposed system is significantly superior to that of the baseline. Our proposed segment-level attention mechanism improves the performance of both CNN and GCNN architecture. Furthermore, by using model ensemble, we have achieved competitive performance on the development dataset of DCASE 2018 task 5. Note that both the top two teams of this task utilized complex methods of data augmentation and model ensemble. Our system can achieve equivalent performance without data augmentation, which shows that our proposed attention mechanism can contribute a lot to home activity monitoring. Since the ground truth labels of evaluation dataset of DCASE 2018 challenge have not been published yet, future work needs to be done for further evaluation.