Since anomalies might indicate faults or malicious activities, prompt detection of anomalies may prevent such problems. Microphones have been used as sensors to detect anomalies, referred to as anomaly detection in sounds (ADS)  or acoustic condition monitoring , in many applications such as audio surveillance [3, 4, 5, 6], machine-condition inspection, and fault diagnosis [7, 8, 9]
. A recent advancement in this area is the use of deep learning[10, 11, 13, 12, 14, 15]
: an autoencoder (AE)[10, 11, 12], variational AE [13, 14], and/or flow-based model  are used to calculate the anomaly score.
A large-scale dataset is essential for successfully training and fairly evaluating a deep neural network (DNN)-based system. Therefore, the existence of freely available large-scale datasets often accelerates related research in this domain. For example, the accuracy of computer-vision tasks has rapidly increased thanks to large-scale datasets such as ImageNet. Large-scale datasets have also contributed to recent advancements in speech and acoustic signal processing such as the Wall Street Journal (WSJ0) speech corpus  for automatic speech recognition and the VCTK corpus  for text-to-speech synthesis. In audio-event detection and scene-classification tasks, two large-scale datasets, AudioSet  and the Freesound dataset , have been published and used in several tasks of the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge [21, 22].
Unfortunately, to the best of our knowledge, no large-scale datasets are freely available for ADS. One reason is anomalous sounds occur far more rarely than normal sounds, resulting in anomalous sounds being difficult to collect. Thus, surveillance tasks, such as gunshot detection, are trained and evaluated on small-scale datasets [23, 24]. Machine-condition inspection and fault-diagnosis tasks hardly have even small datasets. Thus, an ADS system has been evaluated by using a synthetic anomalous sound dataset  instead of collecting anomalous sounds by deliberately damaging expensive machinery. To fairly evaluate systems for anomaly detection in machine operating sound (ADMOS), we believe that a freely available dataset is necessary.
This paper introduces a new dataset called “ToyADMOS” designed for training and testing ADMOS systems. We collected normal and anomalous operating sounds of miniature machines by deliberately damaging their components. Since miniature machines can be installed in an acoustic laboratory, recording conditions can be controlled. With this advantage, we designed the ToyADMOS dataset to be used for not only basic unsupervised-ADMOS  but also multiple advanced tasks such as domain adaptation , noise reduction , data augmentation, and few-shot learning of anomalous sounds . The ToyADMOS dataset has the following characteristics:
It is designed for three ADMOS tasks: product inspection (toy car), fault diagnosis for a fixed machine (toy conveyor), and fault diagnosis for a moving machine (toy train).
Machine-operating sounds and environmental noise are individually recorded for simulating various noise levels.
All sounds are recorded with four microphones for testing noise reduction  and/or data-augmentation techniques such as mix-up.
In each task, multiple machines of the same class are used; each machine belongs to the same class of toys but has a different detailed structure (cf. Sec 3 and supplemental document available on the dataset web-page). Since the collected operating sounds have variations depending on individual differences, the dataset can be used for testing domain-adaptation techniques to absorb individual differences and/or changes in noise level .
Each anomalous sound was recorded several times for testing few-shot learning-based ADMOS for obtaining the characteristics of anomalous sounds from only a few samples .
The released dataset consists of over 180 hours of normal machine-operating sounds and over 4,000 samples of anomalous sounds collected with four microphones at a 48-kHz sampling rate for each task.
This dataset is freely available for download at https://github.com/YumaKoizumi/ToyADMOS-dataset and the license and some use of the dataset are also available on the webpage.
2 Dataset overview
The ToyADMOS dataset consists of three sub-datasets for three types of ADMOS tasks. A different toy is used for each task. The name and overview of each sub-dataset are as follows:
- Toy car:
- Toy conveyor:
- Toy train:
To collect various normal/anomalous sounds depending on individual differences, operating sounds are recorded using three or four models for each type of toy; these models belong to the same class of toys but have different detailed structures. In the ToyADMOS dataset, we use “case” as the identifier of each machine. The details are given in Sec 3.
Each sub-dataset consists of three types of sound data: normal, anomalous, and environmental. Their definitions are as follows:
- Normal sound:
Operating sound when the target machine operates normally in accordance with its specifications.
- Anomalous sound:
Operating sound when the target machine is made to operate anomalously by deliberately damaging its components or adding extraneous objects.
- Environmental noise:
Environmental noise for simulating a factory environment. Noise samples were collected at several locations in an actual factory, such as collision, drilling, pumping, and airbrushing. These sounds were emitted from four loudspeakers at the corners of each recording room.
Four omnidirectional microphones (SHURE SM11-CN) were used for collecting these sounds. All sounds were stored as multiple wav-files categorized into two types: individual (IND) and continuous (CNT). The differences between IND and CNT are shown in Fig. 2. IND wav-files contain the operating sounds of the entire operation (i.e. consisting a starting of a toy to stopping it) in a single wav-file, and each wav-file is approximately 10 sec long. CNT wav-files contain only a some of the operating sounds in a single wave-file; operating sound is recorded continuously and is cut every 10 min. A normal sound consists of both IND- and CNT-files, anomalous sound consists of IND-files, and environmental noise consists of CNT-files. IND and CNT are assumed to be mainly used for training and evaluation, respectively. This is because of the difficulty in collecting IND data in real environments. CNT data can be collected just by recording the operating sound of a working machine, whereas IND-type data collection requires the machine to be started/stopped many times. Thus, IND data collection has a much higher cost than CND data collection, resulting in real-world systems often being trained with CNT datasets.
The main advantage of the ToyADMOS dataset over other datasets [19, 20] is that it was built under controlled conditions. An unsupervised approach is often adopted for ADMOS systems [10, 11, 12, 13, 14, 15] because it is difficult to build an extensive set of anomalous sounds in the real world. Therefore, a DNN is trained by using only given normal sound and anomalous sound is defined as “unknown” sounds, in contrast to supervised DCASE challenge tasks [21, 22] for detecting “defined” anomalous sounds such as gunshots . This definition results in misdetection caused by both a rare normal sound and the difference between the recording condition in training/test dataset. Thus, to analyze system performance and/or the cause of misdetection, all normal sounds in dataset need to be collected under the same condition, like as the ToyADMOS dataset.
The limitation of the ToyADMOS dataset is that toy sounds and real machine sounds do not necessarily match exactly. One of the determining factors of machine sounds is the size of the machine. Therefore, the details of the spectral shape of a toy and a real machine sound often differ, even though the time-frequency structure is similar. Thus, we need to reconsider the pre-processing parameters evaluated with the ToyADMOS dataset, such as filterbank parameters, before using it with a real-world ADMOS system.
3 Details of sub-datasets
|Toy car||Toy conveyor||Toy train|
|# of IND normal sounds per case and channel||1,350 samples||1,800 samples||1,350 samples|
|Total hours||66 hours||60 hours||66 hours|
|# of CNT normal sounds per case and channel||150 samples||at least 124 samples||74 samples|
|Total hours||135 hours||120 hours||197 hours|
|# of IND anomalous sounds per case and channel||250 samples||355 samples||270 samples|
|# of CNT environmental noise samples per case and channel||72 samples||72 samples||72 samples|
|Total hours||48 hours||144 hours||96 hours|
|Toy car||Toy conveyor||Toy train|
|Shaft||- Bent||Tension pulley||- Excessive tension||First Carriage||- Chipped wheel axle|
|Gears||- Deformed||Tail pulley||- Excessive tension||Last Carriage||- Chipped wheel axle|
|- Melted||- Removed|
|Tires||- Coiled (plastic ribbon)||Belt||- Attached metallic object 1||Straight railway track||- Broken|
|- Coiled (steel ribbon)||- Attached metallic object 2||- Obstructing stone|
|- Attached metallic object 3||- Disjointed|
|Voltage||- Over voltage||Voltage||- Over voltage||Curved railway track||- Broken|
|- Under voltage||- Under voltage||- Obstructing stone|
3.1 Toy-car sub-dataset
We assumed a product inspection task and taken up the task of detecting anomalous sounds from the running sound of a toy car on an inspection device, as shown in Fig. 1 (b). A toy car called “mini 4WD”, the four tires of which are driven by a small motor through gears and a shaft, was used as a miniature car machine, as shown in Fig. 1 (a). The motor and a stabilized power supply were connected, and running sounds on an inspection device were recorded with four microphones. The inspection device, microphones, and loudspeakers in the recording room were arranged as shown in Fig. 3 (a).
Each “case” of the toy car was designed as the combination of two types of motors and bearings; thus, the number of cases was four. Each wav-file of IND normal and anomalous sounds was 11 sec long, and 1,350 IND samples were recorded in each case and channel. The total number of hours of IND normal sounds is 66. Approximately 150 CNT samples were recorded in each case and channel. Note that to reduce the motor load, a 10-min break was given per 10-min operation of the motor. Thus, the total number of hours of CNT normal sounds is 135, which is half the total length of CNT-files. Anomalous sounds were generated by deliberately damaging the shaft, gears, tires, and voltage, as shown in Table 2. In total, 250 samples of anomalous sounds were recorded with these combinations (53 patterns) in each case and channel. Since the microphones were positioned the same in all cases, 12 hours of environmental noise were recorded only once.
3.2 Toy-conveyor sub-dataset
We assumed a fault diagnosis task of a fixed machine task in which anomalous sound are detected from the operating sound of a toy conveyor fixed on a desk. We used a conveyor that transports a small tin toy car by driving a belt using a small built-in small, as shown in Fig. 1 (c, upper). The channel 1 microphone was placed on the body of the conveyor, and the other microphones were placed on the desk, as shown in Fig. 1 (c, lower). The desk, microphones, and loudspeakers in the recording room were arranged as shown in Fig. 3 (b).
Three types of conveyors, which were produced by the same manufacturers but had different sizes, were used as “cases” of toy conveyors; thus, the number of cases was three. Each wav-file of IND normal and anomalous sounds was 10 sec long, and 1,800 IND samples were recorded in each case and channel. Thus, the total number of hours of IND normal sounds is 60. At least 124 CNT samples were recorded in each case, and the total number of hours of CNT normal sounds is 120. Anomalous sounds were generated by deliberately damaging the tension pulley, trail pulley, and belt and excessively lowering/raising the voltage, as shown in Table 2. In total, 355 samples of anomalous sounds were recorded with these combinations (60 patterns). Since the first microphone was placed on the conveyor, the microphones were positioned differently in each case. Thus, 12 hours of environmental noise were recorded in all cases.
3.3 Toy-train sub-dataset
We assumed the use of fault diagnosis of a moving machine task, which detects anomalous sounds from the running sound of a toy train. That is, to detect anomalous sounds, we need to combine the observations of four channels. We used HO-scale (large) and N-scale (small) model railways, which are precisely detailed miniature models of railways, as shown in Fig. 1 (d). Sound data were collected with four microphones surrounding the railway track. The microphones and loudspeakers in the recording room were arranged as shown in Fig. 3 (c). Note that since the sizes of the HO- and N-scale railways differed, the positions of microphones also differed. The microphone arrangements shown in Figs. 1 (d) and 3 (c) are for the HO-scale railway. We removed the HO-scale railway and moved the microphones close to the N-scale railway when recording N-scale machine sounds.
Each “case” of a toy train is designed as a combination of two types of trains (commuter and a bullet) and scales (HO-scale and N-scale); thus, the number of cases was four. Each wav-file of IND normal and anomalous sounds was 11 sec long, and 1,350 IND samples were recorded in each case and channel. The total number of hours of IND normal sounds is 66. Seventy-four CNT samples were recorded in each case and channel; thus, the total number of hours of CNT normal sounds is 197. Anomalous sounds were generated by deliberately damaging the first/last carriage and straight/curved railway track, as shown in Table 2. In total, 270 samples of anomalous sounds were recorded with these combinations (54 patterns). Since the microphones were positioned differently in the HO- and N-scale cases, 12 hours of environmental noise were recorded for each case.
4 Evaluation and benchmark
To give a usage and a sense of the usefulness of the ToyADMOS dataset, we tested a simple baseline system on three sub-datasets. The set of Python codes for training, test, and generating the training/test data are available for download at the same address of the ToyADMOS dataset.
We tested a simple unsupervised-ADS task using each sub-dataset of case 1 on channel 1. Note that, to simplify the experiment, we mixed the observations of channels 1–4 in the toy-train sub-dataset and used as a single channel observation. Randomly selected 1,000 samples of IND normal sound samples were used for training, and the other IND normal and IND anomalous sound samples were used for evaluation. We built both training/test datasets by mixing randomly cropped environmental noise. To control the signal-to-noise ratio, we multiplied 3.16 (+10 dB) by the waveforms of target sounds in the toy-car and toy-conveyor sub-datasets and by the waveforms of noise sounds in the toy-train sub-dataset. All sounds were downsampled to a sampling rate of 16 kHz.
We used a simple AE as a normal model, and its network architecture was almost the same used in 
. Each encoder/decoder of the AE has one input fully connected neural network (FCN) layer, four hidden FCN layers, and one output FCN layer. Each hidden layer has 512 hidden units, and the dimensions of the encoder output are 128. The rectified linear unit is used after each FCN layer except the output layer of the decoder. The input vector was a 64-dimensional log-mel-amplitude spectrum, and their before/after 10 time-frames were concatenated to account for previous and future frames. The reconstruction error of the AE was used as the anomaly score, and the parameters of the AE were trained to minimize the anomaly score of normal training samples. We fix the learning rate for the initial 100 epochs and decreased it linearly between 100–200 epochs down to, where we start with a learning rate of . We always concluded the training after 200 epochs.
We calculated the anomaly scores on each time frame of all test wav-files. If the anomaly score exceeded the threshold even for one frame, the wav-file was determined to be anomalous. This system gave the area under the receiver operating characteristic curves of 0.874, 0.981, and 0.843 for the toy-car, toy-conveyor, and toy-train sub-datasets, respectively. By analyzing false-negative detections (i.e. overlooking), we found the system frequently overlooked “over-voltage” sounds in the toy-car sub-dataset and anomalous sounds of “curved railway track” in the toy-train sub-dataset. The details of the spectral shape of the normal sounds and over-voltage sounds were slightly different; however, there was almost no change in the amplitude of these sounds. Moreover, the damaged point of the curved railway was far from all four microphones; therefore, the amplitude of the overlooked anomalous sounds was small. These results indicate one research direction in ADMOS using the ToyADMOS dataset: detecting anomalous sounds whose time-frequency structure does not change much from the normal sounds.
We introduced a new dataset called “ToyADMOS” designed for use in anomaly detection in machine operating sounds (ADMOS). To build a large-scale dataset for ADMOS use, we collected anomalous operating sounds of miniature machines (toys) by deliberately damaging them. The ToyADMOS dataset and some tutorial Python codes are freely available on the Web, and we hope this dataset can contribute to advancing research into anomaly detection in sounds.
Acknowledgements: The authors thank Minato Kanazawa at NTT TechnoCross Corporation for his technical assistance in the data collection process.
-  Y. Koizumi, S. Saito, H. Uematsu, Y. Kawachi, and N. Harada, “Unsupervised Detection of Anomalous Sound based on Deep Learning and the Neyman-Pearson Lemma,” IEEE/ACM Transactions on Audio Speech and Language Processing, pp.212–224, 2019.
-  S. Heinicke, A. K. Kalan, O. J. J. Wagner, R. Mundry, H. Lukashevich, and H. S. Kühl, “Assessing the Performance of a Semi-Automated Acoustic Monitoring System for Primates,” Methods in Ecology and Evolution pp.753–763, 2015.
-  C. Clavel, T. Ehrette, and G. Richard “Events Detection for an Audio-Based Surveillance System,” in Proc. of IEEE International Conference on Multimedia and Expo (ICME), pp.1306–1309, 2005.
-  G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antonacci, and A. Sarti, “Scream and Gunshot Detection and Localization for Audio-Surveillance Systems,” in Proc. of IEEE International Conference on Advanced Video and Signal-based Surveillance (AVSS), pp.21–26, 2007.
S. Ntalampiras, I. Potamitis, and N. Fakotakis “Probabilistic Novelty Detection for Acoustic Surveillance Under Real-World Conditions,”IEEE Transactions on Multimedia, pp.713–719, 2011.
-  P. Foggia, N. Petkov, A. Saggese, N. Strisciuglio, and M. Vento, “Audio Surveillance of Roads: A System for Detecting Anomalous Sounds,” IEEE Transactions on Intelligent Transportation Systems, pp.279–288, 2016.
-  A. Yamashita, T. Hara, and T. Kaneko, “Inspection of Visible and Invisible Features of Objects with Image and Sound Signal Processing,” in Proc. of IEEE/RSJ International Conference on Intelligent Robots and Systems, (IROS), pp.3837–3842, 2006.
-  Y. Koizumi, S. Saito, H. Uematsu, and N. Harada, “Optimizing Acoustic Feature Extractor for Anomalous Sound Detection Based on Neyman-Pearson Lemma,” in Proc. of European Signal Processing Conference (EUSIPCO), pp.698–702, 2017.
-  Y. Koizumi, S. Murata, N. Harada, S. Saito, and H. Uematsu, “SNIPER: Few-shot Learning for Anomaly Detection to Minimize False-Negative Rate with Ensured True-Positive Rate,” in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp.915–919, 2019.
E. Marchi, F. Vesperini, F. Eyben, S. Squartini, and B. Schuller, “A Novel Approach for Automatic Acoustic Novelty Detection using a Denoising Autoencoder with Bidirectional LSTM Neural Networks,”in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp.1996–2000, 2015.
Y. Kawaguchi, and T. Endo,
“How can we detect anomalies from subsampled audio signals?,”
in Proc. of IEEE International Workshop on Machine Learning for Signal Processing(MLSP), pp.1–6, 2017.
-  Y. Koizumi, S. Saito, M. Yamaguchi, S. Murata and N. Harada, “Batch Uniformization for Minimizing Maximum Anomaly Score of DNN-based Anomaly Detection in Sounds” in Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2019.
-  Y. Kawachi, Y. Koizumi, and N. Harada, “Complementary Set Variational Autoencoder for Supervised Anomaly Detection,” in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp.2366–2370, 2018.
-  Y. Kawachi, Y. Koizumi, S. Murata, and N. Harada, “A Two-Class Hyper-Spherical Autoencoder for Supervised Anomaly Detection,” in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp.3047–3051, 2019.
M. Yamaguchi, Y. Koizumi, and N. Harada, “AdaFlow: Domain-Adaptive Density Estimator with Application to Anomaly Detection and Unpaired Cross-Domain Transition,”in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp.3647–3651, 2019.
J. Deng, W. Dong, R. Socher, L. J. Li, K. Li and L. Fei-Fei,
“ImageNet: A Large-Scale Hierarchical Image Database,”
in Proc. of IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp.248–255, 2009.
-  J. Garofalo, D. Graff, D. Paul, and D. Pallett, “CSR-I (WSJ0) Complete,” Linguistic Data Consortium, Philadelphia, 2007.
-  C. Veaux, J. Yamagishi, and K. MacDonald, “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit,” University of Edinburgh. The Centre for Speech Technology Research (CSTR) http://dx.doi.org/10.7488/ds/1994, 2012.
-  J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: an ontology and human-labeled dataset for audio events,” in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp.776–780, 2017.
-  E. Fonseca, J. Pons, X. Favory, F. Font, D. Bogdanov, A. Ferraro, S. Oramas, A. Porter, and X. Serra, “Freesound datasets: a platform for the creation of open audio datasets,” in Proc. of the 18th International Society for Music Information Retrieval Conference (ISMIR 2017), pp.486–493, 2017.
-  A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, “DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline system,” in Proc. of Detection and Classification of Acoustic Scenes and Events challenge (DCASE), 2017.
-  A. Mesaros, T. Heittola, E. Benetos, P. Foster, M. Lagrange, T. Virtanen, and M. D. Plumbley, “Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge,” IEEE/ACM Transactions on Audio Speech and Language Processing, pp.379–393, 2018.
H. Lim, J. Park and Y. Han, “Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks,”in Tech. Report of Detection and Classification of Acoustic Scenes and Events challenge (DCASE Challenge), 2017.
-  E. Cakir and T. Virtanen, “Convolutional Recurrent Neural Networks for Rare Sound Event Detection,” in Tech. Report of Detection and Classification of Acoustic Scenes and Events challenge (DCASE Challenge), 2017.
-  Y. Kawaguchi, R. Tanabe, T. Endo, K. Ichige and K. Hamada, “Anomaly Detection based on an Ensemble of Dereverberation and Anomalous Sound Extraction,” in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp.865–869, 2019.
-  S. J. Reddi, C. Oztireli, S. Kale, and S. Kumar, “On the Convergence of Adam and Beyond,” in Proc. of International Conference on Learning Representations, (ICLR), 2018.