is the task to identify whether the sound emitted from a target machine is normal or anomalous. Automatically detecting mechanical failure is an essential technology in the fourth industrial revolution, including artificial intelligence (AI)–based factory automation. Prompt detection of machine anomaly by observing its sounds may be useful for machine condition monitoring. For connecting Detection and Classification of Acoustic Scenes and Events (DCASE) challenge tasks and real-world problems, we organize a new DCASE task “unsupervised-ASD”.
The main challenge of this task is to detect unknown anomalous sounds under the condition that only normal sound samples have been provided as training data [1, 2, 3, 4, 5, 6]. In real-world factories, actual anomalous sounds rarely occur and are highly diverse. Therefore, exhaustive patterns of anomalous sounds are impossible to deliberately make and/or collect. This means we have to detect unknown anomalous sounds that were not observed in the given training data. This point is one of the major differences in premise between ASD for industrial equipment and the past supervised DCASE challenge tasks for detecting defined anomalous sounds such as gunshots or a baby crying .
The importance of detecting unknown anomalies has been perceived for a long time, and various approaches have been investigated [8, 9, 10, 11, 12, 13, 14, 15, 16, 17]. In early studies, acoustic features for detecting anomalies are designed based on the mechanical structure of the target machine [18, 19, 20]21, 22, 23, 24, 25, 26, 27, 28]. However, although recent studies published large scale datasets for ASD [29, 30, 31], many of these studies have been evaluated with different datasets and metrics, and it results in difficult to make a fair comparison of the effectiveness and characteristics of these methods. We believe that creating a benchmark for ASD by designing a unified dataset and metrics would contribute both accelerating research in this area and industrial use of the latest technologies.
We have designed a DCASE challenge task which contributes as a starting point and a benchmark of ASD research. The dataset, evaluation metrics, a simple baseline system, and other detailed rules are designed so that they did not deviate from the real-world issues.
2 Unsupervised anomalous sound detection
Let -point-long time-domain observation be an observation which includes a sound emitted from the target machine. ADS is an identification problem of determining whether the state of the target machine is a normal or an anomaly from .
To estimate the state of the target, the anomaly score is calculated. Here, the anomaly score takes a large value when the input signal seems to be anomalous, and vice versa. To calculate the anomaly score, we construct an anomaly score calculatorwith parameter . Then, the target is determined to be anomalous when the anomaly score exceeds the pre-defined threshold value as
It is obvious from (1), we need to design so that takes a large value when the audio-clip
is an anomaly. Intuitively, it seems to be a design problem of a classifier for a two-class classification problem. However, this task cannot be solved as a simple classification problem, because only normal sound samples have been provided as training data in the unsupervised-ASD scenario. Thus, the main research question of this task is:how can anomalies be detected without anomalous training data?
3 Task setup
The data used for this task comprises parts of ToyADMOS  and the MIMII Dataset  consisting of the normal/anomalous operating sounds of six types of toy/real machines. These anomalous sounds in these datasets were collected by deliberately damaging target machines. The following six types of toy/real machines are used in this task: Toy-car and Toy-conveyor from ToyADMOS, and Valve, Pump, Fan, and Slide rail from MIMII Dataset. For simplifying the task, we used only the first channel of multi-channel recordings; all recordings are regarded as single-channel recordings of a fixed microphone. Each recording is a single-channel (proximately) 10-sec length audio that includes both a target machine’s operating sound and environmental noise. The sampling rate of all signals has been downsampled to 16 kHz. We mixed a target machine sound with environmental noise, and only noisy recordings are provided as training/test data. The environmental noise samples were recorded in several real factory environments. For the details of the recording procedure, please refer to the papers of ToyADMOS and MIMII Dataset.
In this task, we define two important terms: Machine Type and Machine ID. Machine Type means the kind of machine, which in this task can be one of six: toy-car, toy-conveyor, valve, pump, fan, and slide rail. Machine ID is the identifier of each individual of the same type of machine, which in the training dataset can be of three or four and that of test dataset can be three.
Development dataset includes (i) around 1,000 samples of normal sounds for training and (ii) 100–200 samples each of normal and anomalous sounds for the test for each Machine Type and Machine ID. Evaluation dataset consists of around 400 test samples for each Machine Type and Machine ID, none of which have a condition label (i.e., normal or anomaly). Note that the Machine IDs of the evaluation dataset are different from those of the development dataset. Thus, we also provide the additional training dataset which includes around 1,000 normal samples for each Machine Type and Machine ID used in the evaluation dataset.
3.2 Evaluation metrics
This task is evaluated with the area under the receiver operating characteristic (ROC) curve (AUC) and the partial-AUC (AUC). The AUC is an AUC calculated from a portion of the ROC curve over the pre-specified range of interest. In our metric, the AUC is calculated as the AUC over a low false-positive-rate (FPR) range . The AUC and pAUC are defined as
where is the flooring function and is the hard-threshold functiom which returns 1 when and 0 otherwise. Here, and are normal and anomalous test samples, respectively, and have been sorted so that their anomaly scores are in descending order. Here, and are the number of normal and anomalous test samples, respectively.
The reason for the additional use of the AUC is based on practical requirements. If an ASD system gives false alerts frequently, we cannot trust it, just as “the boy who cried wolf” could not be trusted. Therefore, it is especially important to increase the true-positive-rate under low FPR conditions. In this task, we will use .
3.3 Baseline system and results
The baseline system is a simple autoencoder (AE)-based anomaly score calculator which is used in several conventional studies. The anomaly score is calculated as the reconstruction error of the observed sound. To obtain small anomaly scores for normal sounds, the AE is trained to minimize the reconstruction error of the normal training data. This method is based on the assumption that the AE cannot reconstruct sounds that are not used in training, that is, unknown anomalous sounds.
In the baseline system, we first calculate a log-mel-spectrogram of the input , and and are the number of mel-filters and time-frames, respectively. Then, the log-mel spectrum at is concatenated with before/after frames as , and used the acoustic feature at . Then, anomaly score is calculated as
where is norm, and and are the encoder and decoder of the AE whose parameters are and , respectively. Thus, .
The hyper-parameters of the baseline system are as follows. The encoder/decoder of AEs consists of one input fully-connected-neural-network (FCN) layer, 3 hidden FCN layers, and one output FCN layer. Each hidden layer has 128 hidden units, and the dimension of the encoder output is 8. The rectified linear unit (ReLU) is used after each FCN layer except the output layer of the decoder. We stopped the training process after 100 epochs, and the batch size was 512. Te ADAM optimizer was used, and we fix the learning rate as 0.001.
Results of the baseline system are presented in Table 1. The AUC and
AUC on the development dataset was evaluated using several types of GPUs (RTX 2080, etc.). Because the results produced with a GPU are generally non-deterministic, the average and standard deviation from these 10 independent trials (training and testing) are shown in the table.
4 Challenge results
Challenge results and analysis of the submitted systems will be added for the official submission of the paper to the DCASE 2020 Workshop.
This paper presented an overview of the task and analysis of the solutions submitted to DCASE 2020 Challenge Task 2. Challenge results and analysis of the submitted systems will be added for the official submission of the paper to the DCASE 2020 Workshop.
-  Y. Koizumi, S. Saito, H. Uematsu, and N. Harada, “Optimizing Acoustic Feature Extractor for Anomalous Sound Detection Based on Neyman-Pearson Lemma,” Proc. of Eur. Signal Process. Conf. (EUSIPCO), 2017.
Y. Kawaguchi and T. Endo,
“How Can We Detect Anomalies from Subsampled Audio Signals?,”
Proc. of IEEE Int’l Workshop on Machine Learning for Signal Process.(MLSP), 2017.
-  Y. Koizumi, S. Saito, H. Uematsu, Y. Kawachi, and N. Harada, “Unsupervised Detection of Anomalous Sound based on Deep Learning and the Neyman-Pearson Lemma,” IEEE/ACM Trans. on Audio Speech and Language Processing, 2019.
-  Y. Kawaguchi ; R. Tanabe ; T. Endo ; K. Ichige ; K. Hamada, “Anomaly Detection Based on an Ensemble of Dereverberation and Anomalous Sound Extraction,” Proc. of Int’l Conf. on Acous., Speech, and Signal Process. (ICASSP), 2019.
-  Y. Koizumi, S. Saito, M. Yamaguchi, S. Murata, and N. Harada, “Batch Uniformization for Minimizing Maximum Anomaly Score of DNN-based Anomaly Detection in Sounds,” Proc. of the Workshop on Appl. of Signal Processi. to Audio and Acoust. (WASPAA), 2019.
K. Suefusa, T. Nishida, H. Purohit, R. Tanabe, T. Endo, Y. Kawaguchi, “Anomalous Sound Detection Based on Interpolation Deep Neural Network,”Proc. of Int’l Conf. on Acous., Speech, and Signal Process. (ICASSP), 2020.
-  A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, “DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System,” Proc. of DCASE, 2017.
-  A. Ito, A. Aiba, M. Ito and S. Makino, “Detection of Abnormal Sound using Multi-Stage GMM for Surveillance Microphone,” Proc. Int’l Conf. Info. Assurance and Security, 2009.
-  C. F. Chan and W. M. Eric, “An Abnormal Sound Detection and Classification System for Surveillance Applications,” Proc. of Eur. Signal Process. Conf. (EUSIPCO), 2010.
-  C. N. Doukas and I. Maglogiannis, “Emergency Fall Incidents Detection in Assisted Living Environments Utilizing Motion, sound, and visual perceptual components,” IEEE Trans. Inf. Technol. Biomed., 2011.
S. Ntalampiras, I. Potamitis, and N. Fakotakis “Probabilistic Novelty Detection for Acoustic Surveillance Under Real-World Conditions,”IEEE Trans. on Multimedia, pp.713–719, 2011.
-  S. Lecomte, R. Lengelle, C. Richard, F. Capman, and B. Ravera, “Abnormal Events Detection using Unsupervised One-Class SVM: Application to audio surveillance and evaluation,” Proc. of Int’l Conf. Advanced Video and Signal Based Surveillance, 2011.
-  D. Conte, P. Foggia, G. Percannella, A. Saggese and M. Vento, “An Ensemble of Rejecting Classifiers for Anomaly Detection of Audio Events,” Proc. of Int’l Conf. Advanced Video and Signal-Based Surveillance (AVSS), 2012.
-  R. Bardeli and D. Stein, “Uninformed Abnormal Event Detection on Audio,” Proc. of ITG Symposium on Speech Communication, 2012.
-  F. Aurino, M. Folla, F. Gargiulo, V. Moscato, A. Picariello, C. Sansone, “One-Class SVM Based Approach for Detecting Anomalous Audio Events,” Proc. of Int’l Conf. on Intelligent Networking and Collaborative Systems (INCoS), 2014.
-  Y. Kawaguchi, T. Endo, K. Ichige, and K. Hamada, “Non-Negative Novelty Extraction: A New Non-Negativity Constraint for NMF,” Proc. of Int’l Workshop on Acoust. Signal Enh. (IWAENC), 2018.
-  R. Giri, A. Krishnaswamy, K. Helwani, “Robust Non-Negative Block Sparse Coding for Acoustic Novelty Detection,” Proc. of DCASE, 2019.
-  M. Holguín-Londoño, O. Cardona-Morales, E. F. Sierra-Alonso, J. D. Mejia-Henao, Á. Orozco-Gutiérrez, and G. Castellanos-Dominguez, “Machine Fault Detection Based on Filter Bank Similarity Features Using Acoustic and Vibration Analysis,” Math. Probl. Eng., 2016.
-  K. Minemura, T. Ogawa, T. Kobayashi, “Acoustic Feature Representation Based on Timbre for Fault Detection of Rotary Machines,” Proc. of 2018 Int’l Conf. Sensing, Diagnostics, Prognostics, and Control (SDPC), 2018.
-  Y. Wei, Y. Li, M. Xu, and W. Huang, “A Review of Early Fault Diagnosis Approaches and Their Applications in Rotating Machinery,” Entropy, vol. 21, no. 4, 2019.
E. Marchi, F. Vesperini, F. Eyben, S. Squartini, and B. Schuller, “A Novel Approach for Automatic Acoustic Novelty Detection using a Denoising Autoencoder with Bidirectional LSTM Neural Networks,”Proc. of Int’l Conf. on Acous., Speech, and Signal Process. (ICASSP), 2015.
E. Marchi, F. Vesperini, F. Weninger, F. Eyben, S. Squartini, and B. Schuller, “Non-Linear Prediction with LSTM Recurrent Neural Networks for Acoustic Novelty Detection,”Proc. of Int’l Joint Conf. on Neural Net. (IJCNN), 2015.
-  D. Y. Oh, I. D. Yun, “Residual Error Based Anomaly Detection Using Auto-Encoder in SMD Machine Sound,” Sensors, vol. 18, no. 5, 2018.
-  Y. Kawachi, Y. Koizumi, and N. Harada, “Complementary Set Variational Autoencoder for Supervised Anomaly Detection,” Proc. of Int’l Conf. on Acous., Speech, and Signal Process. (ICASSP), 2018.
-  Y. Kawachi, Y. Koizumi, S. Murata and N. Harada, “A Two-Class Hyper-Spherical Autoencoder for Supervised Anomaly Detection,” Proc. of Int’l Conf. on Acous., Speech, and Signal Process. (ICASSP), 2019.
-  M. Yamaguchi, Y. Koizumi, and N. Harada, “AdaFlow: Domain-Adaptive Density Estimator with Application to Anomaly Detection and Unpaired Cross-Domain Transition,” Proc. of Int’l Conf. on Acous., Speech, and Signal Process. (ICASSP), 2019.
-  Y. Koizumi, S. Murata, N. Harada, S. Saito, and H. Uematsu, “SNIPER: Few-shot Learning for Anomaly Detection to Minimize False-Negative Rate with Ensured True-Positive Rate,” Proc. of Int’l Conf. on Acous., Speech, and Signal Process. (ICASSP), 2019.
-  Y. Koizumi, M.Yasuda, S. Murata, S. Saito, and N. Harada, “SPIDERnet: Attention Network for One-shot Anomaly Detection in Sounds,” Proc. of Int’l Conf. on Acous., Speech, and Signal Process. (ICASSP), 2020.
-  S. Grollmisch, J. Abeßer, J. Liebetrau, and H. Lukashevich “Sounding Industry: Challenges and Datasets for Industrial Sound Analysis”, Proc. of Eur. Signal Process. Conf. (EUSIPCO), 2019.
-  Y. Koizumi, S. Saito, H. Uematsu, N. Harada, and K. Imoto, “ToyADMOS: A Dataset of Miniature-Machine Operating Sounds for Anomalous Sound Detection,” Proc. of the Workshop on Appl. of Signal Process. to Audio and Acoust. (WASPAA), 2019.
-  H. Purohit, R. Tanabe, K. Ichige, T. Endo, Y. Nikaido, K. Suefusa, and Y. Kawaguchi, “MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection,” Proc. of DCASE, 2019.