Health Monitoring of Industrial machines using Scene-Aware Threshold Selection

by   Arshdeep Singh, et al.
IIT Mandi

This paper presents an autoencoder based unsupervised approach to identify anomaly in an industrial machine using sounds produced by the machine. The proposed framework is trained using log-melspectrogram representations of the sound signal. In classification, our hypothesis is that the reconstruction error computed for an abnormal machine is larger than that of the a normal machine, since only normal machine sounds are being used to train the autoencoder. A threshold is chosen to discriminate between normal and abnormal machines. However, the threshold changes as surrounding conditions vary. To select an appropriate threshold irrespective of the surrounding, we propose a scene classification framework, which can classify the underlying surrounding. Hence, the threshold can be selected adaptively irrespective of the surrounding. The experiment evaluation is performed on MIMII dataset for industrial machines namely fan, pump, valve and slide rail. Our experiment analysis shows that utilizing adaptive threshold, the performance improves significantly as that obtained using the fixed threshold computed for a given surrounding only.



page 1

page 2

page 3

page 4


Latent Vector Expansion using Autoencoder for Anomaly Detection

Deep learning methods can classify various unstructured data such as ima...

Autoencoder-based Condition Monitoring and Anomaly Detection Method for Rotating Machines

Rotating machines like engines, pumps, or turbines are ubiquitous in mod...

Adversarially Learned Abnormal Trajectory Classifier

We address the problem of abnormal event detection from trajectory data....

MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection

Factory machinery is prone to failure or breakdown, resulting in signifi...

Anomaly Detection using Deep Reconstruction and Forecasting for Autonomous Systems

We propose self-supervised deep algorithms to detect anomalies in hetero...

DAICS: A Deep Learning Solution for Anomaly Detection in Industrial Control Systems

Deep Learning is emerging as an effective technique to detect sophistica...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automated health monitoring of industrial machinery can help in avoiding unplanned downtime, increased productivity and reduced maintenance schedules. Acoustic monitoring of machinery provides advantages such as readily available sensors (microphones), non-intrusive sensing, and ability for omnidirectional sensing. The types of automatic health monitoring can include vibration sensors [3, 9, 7, 5], but the microphones are non-intrusive, invariant to humid, temperature conditions and there is no requirement of any balancing techniques as being used by the vibration sensors in order to overcome the misalignment error [8].

A challenge in building automatic health monitoring system is the availability of sufficient anomalous111We use abnormal and anomalous term interchangeably. samples [2]. Data-driven approaches for this purpose have the disadvantage of the unavailability of large-scale public datasets. Although, recently, a few [16, 11] releases publicly the industrial machine dataset. Apart from this, it is very difficult and costly to generate faults in a normal operating machine.

To overcome the data-scarcity problem, most of the studies employ unsupervised approach by utilizing only normal machine data to identify anomaly. In this regard, the studies [16, 6, 4, 15, 12]

employ sound signatures for anomaly detection in various applications. The sound signals have been successfully utilized in many other areas as well, such as sound event detection, sound localization, scene classification etc

[1, 17, 14, 13]. In this work, we aim to build an intelligent system which can discriminate between a normally and an abnormally operating machine using sounds produced by the machine.

A typical anomaly identification framework, first computes anomaly scores corresponding to the given machine. Next, an operating point or threshold is chosen to decide whether the score corresponds to the normal machine or the abnormal machine. However, identification of the threshold and an appropriate selection of the threshold under varying noisy condition is a challenging task. Empirically, we show that the appropriate threshold changes as the surrounding conditions changes in Section 3.3.

In this paper, an autoencoder-based model is utilized to monitor health of the machine. We propose a threshold identification formulation by utilizing the distribution of reconstruction error obtained from the autoencoder framework by utilizing normal machine sounds. To overcome the variability of threshold, we propose a convolution neural network (CNN) based scene classification framework, which operates in parallel to the anomaly identification framework. The scene classification framework predicts surrounding and accordingly, chooses an appropriate threshold. The scene classification and autoencoder model do not use anomalous machine data in training at all. Therefore, the proposed framework is an unsupervised and posses awareness of the surrounding as well. The key advantages and major contributions of this paper can be summarized as follows:

  • An unsupervised health monitoring framework is proposed using sounds produced by the industrial machine. Also, a threshold identification formulation is proposed to discriminate between normal and abnormal machine sounds.

  • A CNN-based scene-aware framework is proposed for adaptive selection of the threshold under varying surrounding conditions.

The rest of this paper is organized as follows. In Section 2, proposed methodology is described. Performance evaluation is included in Section 3. Section 4 concludes the paper.

2 Proposed Methodology

In this section, first, we explain feature representation of a sound signal. Next, anomaly identification framework using an autoencoder (AE) is described. Subsequently, a scene classification framework for appropriate selection of threshold is presented in detail.

2.1 Feature representation of a sound signal

A given audio recording is converted into a spectrogram using the short time Fourier transform (STFT) at a sampling frequency of 16kHz and a 50% overlapping window. Next, log-melspectrogram representations are obtained with 64-mel bands followed by logarithmic transformation on mel energies. A window with context size of 5 is used to obtain contextual representations. Each contextual representation

(320= (64 5)), is used as a training instance for the autoencoder model.

2.2 AE-based anomaly identification

The AE-model comprises of fully-connected layers with 320-64-32-32-64-320 units in the each of the layers. The ReLU activation function is used in all layers. The AE-model has the total number of trainable parameters as approx. 47k. The input to the AE is the contextual representations of 320-dimension as explained previously. The objective function of the AE-model is to minimize the reconstruction error between training instances and predicted training instances, corresponding to the normal machine sounds as given in Equation

1. Here, is the reconstruction error between the contextual representations, and . is predicted instance of using the AE-model and represents size of .


Decision making during testing: For a test audio, the contextual representations are obtained as explained in Subsection 2.1. Next, the reconstruction error, as given in Equation 1, is computed for each of the contextual frame using the trained AE-model. Finally, the total reconstruction error corresponding to the test example is computed by averaging the reconstruction error obtained for each contextual frame. Since, the AE-model is trained using only normal machine sounds, therefore, our hypothesis is that the total reconstruction error for the normal machine is lesser than that of the anomalous machine. A threshold is chosen, which decides the discrimination between normal and abnormal class.

Computation of threshold: In this work, we propose to compute threshold () as given in Equation 2.

Figure 1: Overall evaluation framework for health monitoring of an industrial machine. Here, denotes the scene class or surrounding predicted by the S-Net. The denotes the threshold obtained for the surrounding using Equation 2.

Here, and

represents mean and standard deviation of a set of data respectively.

and are sets with elements representing the total reconstruction error corresponding to each example in the validation and training dataset respectively. , as given in Equation 3, represents a scaling factor for deviation around the mean of the reconstructions error. The proposed threshold () represents the average behaviour of reconstruction error for normal machine sounds. We include a maximum unit deviation margin from the average reconstruction error to handle the model complexity, which may arise due to over-fitting of the autoenconder model on training data. The deviation is further controlled by parameter depending upon the ratio of average reconstruction error of validation and training dataset from normal machine sounds.

The can vary under different noisy conditions. Using a fixed , computed in a given surrounding can affect performance severely. The threshold under different noisy conditions, denoted as (here, denotes the given surrounding condition), can be computed similarly as given in Equation 2, but using training and validation samples collected in that surrounding.

2.3 Adaptive selection of threshold

To select an appropriate in varying surrounding conditions, we propose a 1D-CNN based scene (surrounding) classification model, which is referred to as S-Net. The S-Net classifies surroundings into three noisy levels ( {more noisy, noisy, less noisy}). The S-Net, thus, can be utilized to select threshold based on the surrounding noisy conditions. In this work, we fix the surrounding into previously mentioned three noisy levels only. In the future, more noisy levels can be used.

The S-Net comprises of 4-layers. The first layer is convolution layer with 16 filters, each of length 64, followed by a global average pooling layer, dense layer with 64 units and classification layer with 3 units. Each unit has ReLU activation in the network except classification layer. The total number of parameters of S-Net are approx. 2.4k. The input to the S-Net is a 1-dimensional vector of length

. The S-Net is trained using raw segments of audio signals corresponding to the various noise levels. An audio recording is divided into smaller non-overlapping segments, . Each segment

is considered as a training instance. During testing, the probability scores obtained from each test segments of a given audio, are aggregated together. The output unit corresponding to maximum aggregated score is chosen as the ultimate scene class. Utilizing the scene information corresponding to the input audio, the adaptive threshold can be selected accordingly. The overall proposed evaluation framework is shown in Figure


3 Performance Evaluation

3.1 Datasets Used

We utilize MIMII dataset [16] to evaluate the proposed framework for health monitoring of industrial machines. The dataset comprises of normal and abnormal sounds from four industrial machines namely (a) fan, (b) pump, (c) valve and (d) slide rail. Each type of machines consists of multiple individual machine models, which are specified by model identity (ID). For example, fan machine has four ID’s, ID_00, ID_02, ID_04 and ID_06. Each of the audio signals has 10s length and is recorded at 16kHz sampling rate using 8-microphones.

The dataset comes with machine sounds at three different signal-to-noise ratio (SNRs). A real factory noise is recorded in multiple factories. The noise is added in the original machine sounds to generate audio examples at three different SNR; -6dB, 0dB and 6dB. It is important to note that there are three similar set of audio recordings, but, at three different SNR, which are available publicly.

3.2 Training and validation split

AE-model setup: The AE-model is trained for each machine type and ID using 6dB examples of normal sounds of the particular machine type and ID. The training data consists of 300 audio examples (indexed from 1 to 300, as given in the dataset) from normal sounds of 6dB SNR. From rest of the normal examples, we choose randomly 300 audio examples (except slide rail ID_04 and ID_06, where 100 examples are being used) from each of the SNR dataset as a validation dataset for threshold computation. All other examples (referred to as “evaluation dataset”) are used to evaluate the proposed framework.

An audio signal of 10s length is sampled at 16kHz and converted into a single channel by averaging all channels. The log-melspectrogram corresponding to the audio has a size (64

313). This gives a total of 309 contextual frames using a context window of 5 with unit stride as explained in Section

2.1. Under these settings, the training dataset consists of 92700-examples ( 92700= 309 300 ) each of 320-dimension. AE-model is trained using Adam optimizer [10]

for 5k epochs using mean square error as a loss function.

S-Net setup: S-Net is trained for each machine type and ID using normal sounds from three scene classes namely; -6dB, 0dB and 6dB SNR. The training set consists of 300 audio examples from each of the SNR dataset. It is important to note that the similar examples as used in training AE-model, are being utilized in training S-Net as well. The S-Net is trained using Adam optimizer for 100 epochs. The loss function is cross-entropy and early stopping is applied with minimum loss criterion on randomly selected 10% examples from the training data. An audio signal is downsampled at 8kHz and divided into non-overlapping segments of 250ms segments (= 40). The total training dataset of all three classes consists of 36k segments (36k= 300 40 3), each of size 2k.

Area under the curve (AUC), true positive222Here, positive indicates the abnormal data rate (TPR) and false positive rate (FPR) metrics are used for evaluation.

Figure 2: Reconstruction error (normalized) plots for fan ID_06 (a) 6dB, (b) 0dB and (c) -6dB test examples. The red dotted line shows the threshold obtained for each of the SNR levels using Equation 2.

3.3 Performance Analysis

The reconstruction error obtained using the AE-model for evaluation dataset of fan ID_06 at different SNR dataset is shown in Figure 2. It can be observed that the reconstruction error for the abnormal examples is higher than of the normal examples under all noisy conditions. It shows that the proposed AE-model can be utilized to identify the health (either normal or abnormal) of a machine.

Table 1 shows AUC for different machine type and ID for different SNR dataset. Mostly, the AUC is significantly greater than 0.5 for various machines. As noise level increases, the AUC decreases for all the machines.

Machine type AUC
6dB 0dB -6dB
Fan 0.92 0.83 0.65
Pump 0.86 0.82 0.73
Valve 0.75 0.68 0.53
Slide rail 0.93 0.89 0.74
Table 1: Area under the curve (AUC) for various machines averaged across various ID’s at various SNR levels.

Next, the effectiveness of the proposed threshold criterion as given in Equation 2 is measured by obtaining TPR (1 -FPR), at randomly selected thresholds. The performance is shown in Figure 3 for various machines. The performance obtained for all machines, at the threshold selected using Equation 2, has a maximum absolute deviation of approx. 0.1 as that of obtained at any other threshold.

Analysis of performance at different noise levels: The appropriate threshold computed for different SNR dataset is shown as red dotted line in Figure 2. It can be be observed that the threshold (red dotted line) varies as the surrounding condition changes. This leads to reduce performance drastically, when a fixed threshold, computed under a given surrounding condition, is selected to evaluate the performance under varying surroundings. Figure 4 compares TPR, FPR among three evaluation cases (a)-(c), as explained below, for various machines at different SNR.

(a) Baseline: In this case, the performance is computed for a given SNR dataset using the threshold computed for that SNR dataset only.
(b) Scene-aware threshold: In this case, the threshold is computed for each SNR dataset independently and evaluation is performed by selecting appropriate threshold using S-Net framework as shown in Figure 1.
(c) Fixed threshold: This case is similar to the case (b), except that the threshold is computed using 6dB data and is used for evaluation across other SNR dataset.

Figure 3: TPR (1-FPR) obtained at different thresholds for fan, slide rail, pump and valve at 6dB. The various thresholds are computed by varying in (). Here, is same as given in Equation 2. The proposed selected threshold is shown at .
Figure 4: Averaged TPR, FPR across various machine ID’s for evaluation dataset of (a) fan, (b) pump, (c) valve and (d) slide rail machines at different SNR for baseline, scene-aware threshold and fixed threshold evaluation cases.

The TPR, FPR obtained using scene-aware threshold for most of the machines is similar to that of baseline. However, for fixed threshold, the TPR, FPR across all machines approaches to 1 as the SNR level decreases. This shows that the performance obtained using the fixed threshold biases towards the abnormal class only. On the other hand, using scene-aware threshold framework, the performance still remains similar to the baseline performance. This shows the effectiveness of the proposed scene-aware threshold to select appropriate threshold, which can cope up the variations occurred due to the surrounding environment.

4 Conclusion

In this paper, we propose an unsupervised health monitoring framework, which identifies and adaptively selects an appropriate threshold, to identify the anomaly using sounds produced by the industrial machine. The proposed health monitoring framework requires only normal machine sounds and surrounding conditions, which are easy to collect. Therefore, the proposed approach can be easily utilized in the real-factory to monitor health of industrial machines. In future, we aim to utilize more SNR levels to validate the effectiveness of the proposed framework.


  • [1] Y. Aytar, C. Vondrick, and A. Torralba (2016) Soundnet: learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems, pp. 892–900. Cited by: §1.
  • [2] L. Bull, K. Worden, G. Manson, and N. Dervilis (2018) Active learning for semi-supervised structural health monitoring. Journal of Sound and Vibration 437, pp. 373–388. Cited by: §1.
  • [3] W. Caesarendra and T. Tjahjowidodo (2017)

    A review of feature extraction methods in vibration-based condition monitoring and its application for degradation trend estimation of low-speed slew bearing

    Machines 5 (4), pp. 21. Cited by: §1.
  • [4] D. Conte, P. Foggia, G. Percannella, A. Saggese, and M. Vento (2012) An ensemble of rejecting classifiers for anomaly detection of audio events. In 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance, pp. 76–81. Cited by: §1.
  • [5] G. S. Galloway, V. M. Catterson, T. Fay, A. Robb, and C. Love (2016) Diagnosis of tidal turbine vibration data through deep neural networks. In EUROPEAN CONFERENCE OF THE PROGNOSTICS AND HEALTH MANAGEMENT SOCIETY, Cited by: §1.
  • [6] K. Hendrickx, W. Meert, Y. Mollet, J. Gyselinck, B. Cornelis, K. Gryllias, and J. Davis (2020) A general anomaly detection framework for fleet-based condition monitoring of machines. Mechanical Systems and Signal Processing 139, pp. 106585. Cited by: §1.
  • [7] T. Heyns, P. S. Heyns, and J. P. De Villiers (2012)

    Combining synchronous averaging with a gaussian mixture model novelty detection scheme for vibration-based condition monitoring of a gearbox

    Mechanical Systems and Signal Processing 32, pp. 200–215. Cited by: §1.
  • [8] J. Isavand, A. Kasaei, A. Peplow, B. Afzali, and E. Shirzadi (2020) Comparison of vibration and acoustic responses in a rotary machine balancing process. Applied Acoustics 164, pp. 107258. Cited by: §1.
  • [9] X. Jin and T. W. Chow (2013) Anomaly detection of cooling fan and fault classification of induction motor using mahalanobis–taguchi system. Expert Systems with Applications 40 (15), pp. 5787–5795. Cited by: §1.
  • [10] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.2.
  • [11] Y. Koizumi, S. Saito, H. Uematsu, N. Harada, and K. Imoto (2019) ToyADMOS: a dataset of miniature-machine operating sounds for anomalous sound detection. In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 313–317. Cited by: §1.
  • [12] Y. Koizumi, S. Saito, H. Uematsu, Y. Kawachi, and N. Harada (2018)

    Unsupervised detection of anomalous sound based on deep learning and the neyman–pearson lemma

    IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (1), pp. 212–224. Cited by: §1.
  • [13] A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen (2017) DCASE 2017 challenge setup: tasks, datasets and baseline system. In DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events, Cited by: §1.
  • [14] A. Mesaros, T. Heittola, and T. Virtanen (2016) TUT database for acoustic scene classification and sound event detection. In Signal Processing Conference (EUSIPCO), 2016 24th European, pp. 1128–1132. Cited by: §1.
  • [15] Y. Ono, Y. Onishi, T. Koshinaka, S. Takata, and O. Hoshuyama (2013) Anomaly detection of motors with feature emphasis using only normal sounds. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2800–2804. Cited by: §1.
  • [16] H. Purohit, R. Tanabe, T. Ichige, T. Endo, Y. Nikaido, K. Suefusa, and Y. Kawaguchi (2019-10) MIMII dataset: sound dataset for malfunctioning industrial machine investigation and inspection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), New York University, NY, USA, pp. 209–213. Cited by: §1, §1, §3.1.
  • [17] J. Salamon and J. P. Bello (2017) Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Processing Letters 24 (3), pp. 279–283. Cited by: §1.