There is a rising interest in smart environments that enhance the quality of life for humans in terms of e.g. safety, security, comfort, and home care .
In order to have smart functionality, situational awareness is required, which might be obtained by interpreting a multitude of sensing modalities including acoustics. Compared to other modalities, microphone sensors contain highly informative data which can be exploited for multiple purposes . However, many challenges remain regarding the automatic recognition of sounds. In order to properly compare different computational methods the community needs common publicly available datasets. Previous editions of the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge offered a competitive platform to compare different computational methods using common datasets for various problems related to automatic classification and detection of sound events and scenes [4, 5, 6]. This year’s challenge, DCASE 2018 , consists of five tasks. This paper describes Task 5 which resides in the context of Ambient Assisted Living (AAL) where persons are monitored, e.g. to support patients with a chronic illness and older persons, by tracking their activities being performed at home [3, 8, 9, 10]. When considering an acoustic sensing modality, a domestic activity can be seen as an acoustic scene. An acoustic event is defined as a single consecutive event originating from a single sound source, e.g. a hand clap or a door knock.
The ensemble of multiple events create an acoustic scene describing a certain environment (e.g. a park or a living room) or, relevant to this task, an activity being performed by a person (e.g. cooking or watching TV). The acoustic sensing literature has mainly covered the problems of automatic classification and detection of sound events and scenes by using spectral information [8, 9, 4]. Similarly, previous DCASE challenges did not focus on letting participants exploit spectral and spatial information. Task 5 offers a multi-channel dataset to compare computional methods that use both types of information. The goal is to exploit spectral and spatial cues independent of sensor location using multi-channel audio for the purpose of classifying domestic activities.
This paper presents DCASE 2018 Task 5 in detail. A task definition, information about the dataset, the task setup, baseline system, and baseline results on the development dataset are provided.
2 Task Setup
The goal of this task is to classify multi-channel audio segments (i.e. segmented data is given), acquired by a microphone array, into one of the provided predefined classes as illustrated by Figure 1. These classes are daily activities performed in a home environment (e.g. “Cooking”, “Watching TV” and “Working”).
As they can be composed out of different sound events, such activities are considered as acoustic scenes. Therefore, this task is quite similar to DCASE 2018 Task 1: Acoustic scene classification. The difference lies in the type of scenes and the possibility to use multi-channel audio.
For the given problem a single person living at home is considered. This reduces the complexity since the number of overlapping activities is expected to be small. In fact, in the considered data set no overlapping activities are present. These conditions were chosen to be able to focus on the main goal of this task which is to investigate to which extent multi-channel acoustic recordings are beneficial for the purpose of detecting domestic activities. This means that spatial properties can be exploited to serve as input features to the classification problem. However, using estimates of absolute angles or positions of sound sources as input for the detection model is doomed not to generalize well to cases where the position of the microphone array is altered. Therefore, in this task the focus is on systems which can exploit spatial cues independent of sensor position using multi-channel audio.
The DCASE 2018 Challenge consists of five tasks related to automatic classification and detection of sound events and scenes . All these tasks follow the same timeline and similar submission guidelines. Table 1 introduces the timeline of the Challenge/Task. First, a development dataset was provided along with reference annotations and a baseline system. A month prior to the submission deadline the evaluation dataset was released. Challenge submissions consist of a systems output on the evaluation dataset formatted according to the requirements described in . In order to compare and understand all system(s), participants were also required to submit a technical report containing the description of the system(s) in detail. Reference annotations for the evaluation data were only available to us, therefore we were responsible for evaluating the results according to the defined metric . These results will be made public after evaluation. Finally, the results will also be presented on the DCASE2018 Workshop. Participants could optionally submit their technical report as a paper to this conference.
|Release of development dataset||30 March 2018|
|Release of baseline system||16 April 2018|
|Release of evaluation dataset||30 June 2018|
|Challenge submission||31 July 2018|
|Publication of results||15 Sept 2018|
|DCASE2018 Workshop||19-20 Nov 2018|
2.3 Development and evaluation dataset
The datasets used in this task are a derivative of the SINS dataset . The SINS dataset contains a continuous recording of one person living in a vacation home over a period of one week. It was collected using a network of 13 microphone arrays distributed over the entire home. The microphone array consists of four linearly arranged microphones. For this task seven microphone arrays in the combined living room and kitchen area are used. More information about the SINS dataset can be found in . Figure 2 shows the floorplan of the recorded environment along with the position of the used microphone array.
For both the development and evaluation dataset the continuous recordings were split into audio segments of 10 s. We choose this window size as it is in the same order as the shortest duration of an activity. Each audio segment contains four channels which represent the four microphone channels from a particular microphone array. Segments containing more than one active class (e.g. a transition of two actitivies) were left out. This means that each segment represents a single activity. For the development and evaluation set we took a time-wise subset (from the SINS dataset) of and respectively. Full sessions of a particular activity were kept together as shown by Figure 3. Approximately 50% of the data was left out by subsampling to make the dataset easier to use for a challenge.
For the development dataset we provided the data along with reference annotations. The provided data was recorded by four out of the seven microphone arrays given the time-wise subset. The exact positions of these microphone arrays were not made available and not should not be exploited. We also provided a cross-validation setup containing four folds. Participants were encouraged to use these to make results reported on the development set uniform but it was not mandatory to use them. The daily activities (9) are shown in Table 2 along with the number of 10 second multi-channel segments and the amount of full sessions of a certain activity (e.g. a cooking session). Compared to the SINS dataset, we have combined the activity ”Phone call” and ”Visit” into one activity named ”Social activity”. The class ”Absence” is referred to as not being present in the room. However, the data does contain recordings when a person is present in another room. Depending on the activity, this is noticable in the recording. The class ”Other” is referred to as being present in the room but not doing any another activity in the list. All the other activities are self-explanatory. Note that the dataset is unbalanced.
The evaluation dataset is acquired in a similar manner as the development dataset and has a similar class distribution. More statistics about this dataset will be made available after the task is finished. The dataset was provided as audio only. It contains data recorded by all seven microphone arrays on a different time-wise subset as the development set. The final evaluation will be based on the data obtained by the microphone arrays not present in the development set. The other microphone arrays are provided to give insights about the overfitting on those positions.
Participants were allowed to use external data (including pre-trained models) and data augmentation for system development. It was not allowed to use the evaluation data to train the submitted system in an (un)supervised manner.
2.4 Baseline system
The baseline system is intended to make it easier to participate in the task and to provide a reference performance. The system has all the functionality for dataset handling, calculating features and models, and evaluating the results. The system is implemented in Python, primarily using the DCASE UTIL libary 14] for learning.
The baseline system trains a single classifier model that takes a single channel as input. During the recording campaign, data was measured simultaneously using multiple microphone arrays each containing 4 microphones. Hence, each domestic activity is recorded as many times as there were microphones. Each parallel recording of a single activity is considered as a different example during training. The learner in the baseline system is based on a Neural Network architecture using two convolutional and one dense layer. As input, log mel-band energies are provided to the network. The features extracted from each microphone channel are treated as seperate examples. In the prediction stage a single outcome is computed for each microphone array by averaging the 4 model outcomes (posteriors) that were computed by evaluating the trained classifier model on all 4 microphones. The features are calculated in frames of 40 ms with 50 % overlap, using 40 mel bands covering a frequency range from 50 tot 8000 Hz. An overview of the Neural Network architecture is shown in Figure 4
. It uses an input size of 40x501 which are the log-mel frames in a total duration of 10 s. The first convolutional layer has 32 filters with a kernel size of 40x5 with a stride of one, so therefore convolution is only performed over the time axis. Subsampling is then performed by Max Pooling by a factor 5. The resulting feature map of 32x99 is then provided to a second convolutional layer that has 64 filters with a kernel size of 32x3 and a stride of one. Subsequentially, this is subsampled using Max Pooling by a factor 3. After each convolutional layer Batch Normalization and ReLU activation is used. The resulting output, a feature vector of 64 coefficients, is then provided to a Fully Connected (FC) layer of 64 neurons with ReLU activation. The output layer consists of 9 neurons representing the output classes with Softmax activation. For regularization we have used Dropout (20%) between each layer. The network is trained using Adam optimizer with a learning rate of 0.0001. The used batch size is 256 segments which results in a total of 1024 examples provided to the learner given that each segment contains 4 channels of audio. On each epoch, the training dataset is randomly subsampled so that the number of examples for each class match the size of the smallest class. The performance of the model is evaluated every 10 epochs, of 500 in total, on a validation subset (30% subsampled from the training set). The model with the highest score is used as the final model. As a metric the macro-averaged F1-score is used, which is the mean of the class-wise F1-scores.
|Activity||F1-score dev.||F1-score eval.|
3 Baseline system results
presents the results for the baseline system on the development set using the provided cross-validation folds. The performance on the evaluation set will be released when the results of task are made public. Regarding the results on the development set, the system was trained and tested five times to provide an estimate on the variance related to random weight initializations and a random validation split. On average the model has a performance of 84.50%. Class-wise performances vary from 44.76% up to 99.59%. Worst performing classes are Other and Dishwashing, while the best performing ones are Vacuum cleaning, Watching TV and Cooking.
In this paper we introduced the setup of the DCASE2018 Task 5 challenge, a task primarily concerned with systems exploiting both spectral and spatial information indepent of sensor location using multi-channel audio. The dataset used for the task offers recordings of domestic activities in a single home environment [15, 16]. A baseline system was introduced based on a Neural Network architecure using two convolutional layers and a dense layer. Results were reported on this dataset using the provided publicly available baseline system  showing a macro-averaged F1-score of 84.5%.
More information on the evaluation dataset and the performance will be made available when the task results are made public.
Thanks to Steven Lauwereins, Bart Thoen, Mulu Weldegebreal Adhana, Henk Brouckxon and Bertold Van den Bergh to their contribution of acquiring the SINS dataset .
-  SINS. Sound INterfacing through the Swarm. [Online]. Available: http://www.esat.kuleuven.be/sins/
-  F. Erden, S. Velipasalar, A. Z. Alkar, and A. E. Cetin, “Sensors in assisted living: A survey of signal and image processing methods,” IEEE Signal Processing Magazine, vol. 33, no. 2, pp. 36–44, March 2016.
-  M. Vacher, F. Portet, A. Fleury, and N. Noury, “Development of audio sensing technology for ambient assisted living: Applications and challenges,” International Journal of E-Health and Medical Communications (IJEHMC), vol. 2, no. 1, pp. 35–54, January 2011.
-  A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, “DCASE 2017 Challenge setup: Tasks, datasets and baseline system,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), Munich, Germany, November 2017.
-  D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley, “Detection and classification of acoustic scenes and events,” IEEE Transactions on Multimedia, vol. 17, no. 10, pp. 1733–1746, Oct 2015.
-  A. Mesaros, T. Heittola, E. Benetos, P. Foster, M. Lagrange, T. Virtanen, and M. D. Plumbley, “Detection and classification of acoustic scenes and events: Outcome of the dcase 2016 challenge,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 26, no. 2, pp. 379–393, Feb. 2018. [Online]. Available: https://doi.org/10.1109/TASLP.2017.2778423
-  DCASE. (2018) DCASE2018 Challenge. [Online]. Available: http://dcase.community/challenge2018/
-  L. Vuegen, B. Van Den Broeck, P. Karsmakers, H. Van hamme, and B. Vanrumste, “Automatic monitoring of activities of daily living based on real-life acoustic sensor data: A preliminary study,” in Proc. Fourth workshop on speech and language processing for assistive technologies (SLPAT), 2013, pp. 113–118.
-  ——, “Energy efficient monitoring of activities of daily living using wireless acoustic sensor networks in clean and noisy conditions,” in 2015 23rd European Signal Processing Conference (EUSIPCO), Aug 2015, pp. 449–453.
-  M. Vacher, B. Lecouteux, P. Chahuara, F. Portet, B. Meillon, and N. Bonnefond, “The Sweet-Home speech and multimodal corpus for home automation interaction,” in The 9th edition of the Language Resources and Evaluation Conference (LREC), Reykjavik, Iceland, 2014, pp. 4499–4506. [Online]. Available: http://hal.archives-ouvertes.fr/hal-00953006
-  DCASE and KU Leuven. (2018) DCASE2018 Task 5: Monitoring of domestic activities based on multi-channel acoustics. [Online]. Available: http://dcase.community/challenge2018/task-monitoring-domestic-activities
-  G. Dekkers, S. Lauwereins, B. Thoen, M. W. Adhana, H. Brouckxon, T. van Waterschoot, B. Vanrumste, M. Verhelst, and P. Karsmakers, “The SINS database for detection of daily activities in a home environment using an acoustic sensor network,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), Munich, Germany, November 2017, pp. 32–36.
-  Toni Heittola. (2018) DCASE UTIL: Utilities for Detection and Classification of Acoustic Scenes. [Online]. Available: https://dcase-repo.github.io/dcase˙util/
Keras. (2018) Keras: The Python Deep Learning library. [Online]. Available:https://keras.io/
-  G. Dekkers and P. Karsmakers. (2018) DCASE 2018, Task 5: Monitoring of domestic activities based on multi-channel acoustics - Development dataset. [Online]. Available: https://zenodo.org/record/1247102
-  ——. (2018) DCASE 2018, Task 5: Monitoring of domestic activities based on multi-channel acoustics - Development dataset. [Online]. Available: https://zenodo.org/record/1291760
-  G. Dekkers, T. Heittola, and P. Karsmakers. (2018) DCASE2018 Task 5: system baseline. [Online]. Available: https://github.com/DCASE-REPO/dcase2018˙baseline/tree/master/task5