Source codes for the paper "Deep Learning Algorithms for Rotating Machinery Intelligent Diagnosis: An Open Source Benchmark Study"
With the development of artificial intelligence and deep learning (DL) techniques, rotating machinery intelligent diagnosis has gone through tremendous progress with verified success and the classification accuracies of many DL-based intelligent diagnosis algorithms are tending to 100%. However, different datasets, configurations, and hyper-parameters are often recommended to be used in performance verification for different types of models, and few open source codes are made public for evaluation and comparisons. Therefore, unfair comparisons and ineffective improvement may exist in rotating machinery intelligent diagnosis, which limits the advancement of this field. To address these issues, we perform an extensive evaluation of four kinds of models with various datasets to provide a benchmark study within the same framework. In this paper, we first gather most of the publicly available datasets and give the complete benchmark study of DL-based intelligent algorithms under two data split strategies, five input formats, three normalization methods, and four augmentation methods. Second, we integrate the whole evaluation codes into a code library and release this code library to the public for better development of this field. Third, we use the specific-designed cases to point out the existing issues, including class imbalance, generalization ability, interpretability, few-shot learning, and model selection. By these works, we release a unified code framework for comparing and testing models fairly and quickly, emphasize the importance of open source codes, provide the baseline accuracy (a lower bound) to avoid useless improvement, and discuss potential future directions in this field. The code library is available at <https://github.com/ZhaoZhibin/DL-based-Intelligent-Diagnosis-Benchmark>.READ FULL TEXT VIEW PDF
Recent progress on intelligent fault diagnosis has greatly depended on t...
The recent interest in using deep learning for seismic interpretation ta...
Deep learning hyper-parameter optimization is a tough task. Finding an
As an important problem in computer vision, salient object detection (SO...
Deep learning (DL) has recently achieved tremendous success in a variety...
BCI algorithm development has long been hampered by two major issues: sm...
Motivated by the prowess of deep learning (DL) based techniques in
Source codes for the paper "Deep Learning Algorithms for Rotating Machinery Intelligent Diagnosis: An Open Source Benchmark Study"
Prognostics health management (PHM) is one of the most essential systems in modern industrial equipment, such as helicopter, aero-engine, wind turbine, and high speed train. The main function of PHM systems used in rotating machinery is intelligent fault diagnosis for condition-based maintenance. Intelligent fault diagnosis is the key component of PHM systems and has been studied widely. Traditional intelligent diagnosis methods mainly consist of the feature extraction using various signal processing methods and the fault classification using various machine learning techniques. Although advanced signal processing methods (fast Fourier transform (FFT), spectrum kurtosis (SK), wavelet transform (WT), sparse representation, etc.) and machine learning algorithms (k-nearest neighbor (KNN), artificial neural network (ANN), support vector machine (SVM), etc.) have been successfully applied to intelligent diagnosis and have made considerable progress, it remains a challenging problem about how to perform diagnosis precisely and efficiently. With the development of online condition monitoring and data analysis systems, increasingly different kinds of real-time data are transferred from operating machines and the massive data are gained in the cloud. Facing with these heterogeneous massive data, feature extraction methods and mapping abilities from signals to conditions that are designed and chosen by experts, to a great extent depending on prior knowledge, are time-consuming and empirical.
Deep learning (DL) as a booming data mining technique has swept many fields including computer vision (CV)(23; 10)
, natural language processing (NLP)(18; 50; 59)
, etc. In 2006, the concept of DL was first introduced through proposing the deep belief network (DBN)(17). In 2013, MIT Technology Review ranked the DL technology as the top ten breakthrough technologies (36). In 2015, a review (24)
published in nature stated that DL allows computational models composed of multiple processing layers to learn data representations with multiple levels of the abstraction. Due to its strong representation learning ability, DL is well-suited to data analysis. Therefore, in the field of intelligent fault diagnosis, many researchers have applied DL-based techniques, such as multi-layer perception (MLP), auto-encoder (AE), convolutional neural network (CNN), deep belief network (DBN), and recurrent neural network (RNN) to various fields. A large number of DL-based intelligent diagnosis algorithms have been proposed in recent years and their classification accuracies have been tending to 100%. However, when different researchers design DL-based intelligent diagnosis algorithms, they often recommend to use different inputs (like time domain input, frequency domain input, time-frequency domain input, wavelet domain input, slicing image input, etc.) and set different hyper-parameters (like the dimension of the input, the learning rate, the batch size, the network architecture, etc.). In addition, few authors make their codes available for evaluation and comparison, and others are difficult to repeat the results completely and correctly. Therefore, unfair comparisons and ineffective improvement may exist in this field. Considering that this field lacks open source codes and benchmark study, it is crucial to evaluate and compare different DL-based intelligent diagnosis algorithms to provide the benchmark or the lower bound of their accuracies and performance, thereby helping further studies in this field for more persuasive and appropriate algorithms.
For comprehensive performance comparisons and evaluation, it is important to gather different kinds of datasets. Actually, there exist several datasets for intelligent fault diagnosis. However, not every dataset provides a detailed description and is suited for the fault classification. For some datasets, the category discrimination is relatively large, and even one simple classifier can achieve acceptable results. Therefore, to thoroughly perform data mining and assess the difficulty of datasets, it is necessary to collect different datasets in a library and evaluate the performance of algorithms for different datasets on a unified platform.
In addition, one common issue in intelligent fault diagnosis is that for splitting data, and researchers often use the random split strategy. This strategy is dangerous since if the preparation process exists any overlap for samples, the evaluation of classification algorithms will have test leakage (42). As for industrial data, they are rarely random and are always sequential (they might contain trends in time domain). Therefore, it is more appropriate to split data according to time sequences (we simply call it order split) (42). Actually, order split is closer to reality, because we always use historical data to predict the future condition in industry. Conversely, if we randomly split the data, it might be possible for the diagnosis algorithms to record the future patterns, and this might cause another pitfall with test leakage.
In this paper, we first collect most of the publicly available datasets and discuss whether it is suitable for intelligent fault diagnosis. Second, we release a code library of the data preparation for all datasets which are suitable for fault classification and the whole evaluation framework with different input formats, normalization methods, data split ways, augmentation methods, and DL-based models. Meanwhile, we also use some datasets to discuss the existing issues in intelligent fault diagnosis including class imbalance, generalization ability, interpretability, few-shot learning, and model selection. To the best of our knowledge, this is the first work to comprehensively perform the benchmark study and release the code library of DL-based intelligent algorithms. In summary, this work mainly focuses on evaluating various DL-based intelligent diagnosis algorithms for most of the publicly available datasets from several perspectives, providing the benchmark accuracy (it is worth mentioning that the results are just a lower bound of accuracy) to avoid useless improvement, and releasing the code library for complete evaluation procedures. Through these works, we hope to make comparing and testing models fairer and quicker, emphasize the importance of open source codes and the benchmark study in this field, and provide some suggestions and discussions of future studies.
The contributions of this paper are listed as follows:
Various datasets and data preparing. We gather most of the publicly available datasets and give the detailed discussion about its adaptability to DL-based intelligent diagnosis. For data preparing, we first discuss different kinds of input formats and different normalization methods for listed datasets. After that, we state that data augmentation which is a common step in CV and NLP might be important to make the training datasets more diverse, and we also try some kinds of data augmentation methods to clarify that they have not been fully investigated. Meanwhile, we also discuss the way of data split and state that it may be more appropriate to split data according to time sequences (also called order split).
Benchmark accuracy and further studies. We evaluate various DL-based intelligent diagnosis algorithms including MLP, AE, CNN, and RNN for different datasets and provide the benchmark accuracy to make the future studies in this field more comparable and meaningful. We also use the experimental examples to discuss the existing problems in intelligent fault diagnosis including class imbalance, generalization ability, interpretability, few-shot learning, and model selection problems.
Open source codes. For enhancing the importance and reproducibility of DL-based intelligent diagnosis algorithms, we release the whole evaluation codes in a code library for the better development of this field. At the same time, this is a unified intelligent fault diagnosis library, which retains an extended interface for everyone to load their own datasets and models by themselves to carry out new studies. The code library is available at https://github.com/ZhaoZhibin/DL-based-Intelligent-Diagnosis-Benchmark.
The outlines of the paper are listed as follows: In Section 2, we give a brief review of recent development of DL-based intelligent diagnosis algorithms. Then, Sections 3 to 9 discuss the evaluation algorithms, datasets, data preprocessing, data augmentation, data split, evaluation methodologies and evaluation results, respectively. After that, Section 10 makes some further discussions and the results, followed by conclusions in Section 11.
Recently, DL has become a promising method in a large scope of fields, and a huge amount of papers related to DL have been published since 2012. This paper mainly focuses on a benchmark study of intelligent fault diagnosis, rather than providing a comprehensive review on DL for other fields. Some famous DL researchers have published more professional references and interested readers can refer to (24; 11).
In the field of intelligent fault diagnosis, due to the efforts of many researchers in recent years, DL has become one of the most popular data-driven methods to perform fault diagnosis and health monitoring. In general, DL-based methods can extract representative features adaptively without any manual intervention and can achieve higher accuracy than traditional machine learning algorithms in most of the tasks when the dataset is large enough. We conducted a literature search using Web of Science with a database called web of science core collection. As shown in Fig. 1, it can be observed that the number of published papers related to DL-based intelligent algorithms increases year by year.
Another interesting observation is that many review papers on this topic have been published in the recent four years. Therefore, in this paper, we only briefly review and introduce the main contents of different review papers to allow readers who just enter this field to find suitable review papers quickly.
In bearing fault diagnosis, Li et al. (29) provided a systematic review of fuzzy formalisms including combination with other machine learning algorithms. Hoang et al. (19) provided a comprehensive review of three popular DL algorithms (AE, DBN, and CNN) for bearing fault diagnosis. Zhang et al. (61) systematically reviewed the machine learning and DL-based algorithms for bearing fault diagnosis and also provided a comparison of the classification accuracy of CWRU with different DL-based methods. Hamadache et al. (13) reviewed different fault modes of rolling element bearings and described various health indexes for PHM. Meanwhile, it also provided a survey of artificial intelligence methods for PHM including shallow learning and deep learning.
reviewed Al-based approaches including KNN, SVM, ANN, Naive Bayes, and DL for fault diagnosis of rotating machinery. Wei et al.(56) summarized early fault diagnosis of gears, bearings, and rotors through signal processing methods (adaptive decomposition methods, WT, and sparse decomposition) and AI-based methods (KNN, neural network, and SVM).
reviewed computational intelligent approaches including ANN, evolutionary algorithms, fuzzy logic, and SVM for machinery fault diagnosis. Zhao et al.(65)
reviewed data-driven machine health monitoring through DL methods (AE, DBN, CNN, and RNN) and provided the data and codes (in Keras) about an experimental study.
In addition, Nasiri et al. (37) surveyed the state-of-the-art AI-based approaches for fracture mechanics and provided the accuracy comparisons achieved by different machine learning algorithms for mechanical fault detection. Tian et al. (52) surveyed different modes of traction induction motor fault and their diagnosis algorithms including model-based methods and AI-based methods. Khan et al. (21) provided a comprehensive review of AI for system health management and emphasized the trend of DL-based methods with limitations and benefits. Stetco et al. (49) reviewed machine learning approaches applied to wind turbine condition monitoring and made a discussion of the possibility for the future research. Ellefsen et al. (8) reviewed four well-established DL algorithms including AE, CNN, DBN, and LSTM for PHM applications and discussed the chances and challenges for the future studies, especially in the field of PHM in autonomous ships. AI-based algorithms (traditional machine learning algorithms and DL-based approaches) and applications (smart sensors, intelligent manufacturing, PHM, and cyber-physical systems) were reviewed in (1; 6; 55; 47) for smart manufacturing and manufacturing diagnosis.
Although a large body of DL-based methods and many related reviews have been published in the field of intelligent fault diagnosis, few studies thoroughly evaluate various DL-based intelligent diagnosis algorithms for most of the publicly available datasets, provide the benchmark accuracy, and release the code library for complete evaluation procedures. For example, a simple code written in Keras was published in (65), which is not comprehensive enough for different datasets and models. The accuracy comparisons were provided in (61; 37) according to existing papers, but they were not comprehensive enough due to different configurations and test conditions. Therefore, this paper is intended to make up for this gap and emphasize the importance of open source codes and the benchmark study in this field.
A large amount of DL-based intelligent diagnosis methods have been published in the field of fault diagnosis and prognosis. It is impossible to cover all the published models since there is currently no open source community in this field. Therefore, we switch to test the performance of four categories of representative models (MLP, AE, CNN, and RNN) embedding some advanced techniques. It should be noted that DBN is also another commonly used DL methods for fault diagnosis, but we do not add it into this code library due to that the fact the training way of DBN is much different from those four categories.
Multilayer Perception (MLP) (44)
, which was a fully connected network with one or more hidden layers, was proposed in 1987 as the prototype of an artificial neural network (ANN). With such a simple structure, MLP can complete some easy classification tasks such as MNIST. But as the task becomes more complex, it will be hard to train because of the huge amount of parameters. MLP with five fully connected layers and five batch normalization layers is used in this paper for the one dimension (1D) input data. The structure and parameters of the model are shown in Fig.2. Besides, in Fig. 2, FC means the fully connected layer, BN means the Batch Normalization layer, and CE loss means the softmax cross-entropy loss.
Auto-encoder(AE) was first proposed in 2006 as a method for dimensionality reduction. It can reduce the dimensionality of the input data while retaining most of the information in the data. AE consists of an encoder and a decoder, which tries to reconstruct the input from the output of the encoder, and the reconstruction error is used as a loss function. The encoder and decoder are trained to generate the low-dimension representation of the input and reconstruct the input from low-dimension representation, respectively. Subsequently, various derivatives of AE were proposed by researchers, such as variational auto-encoder (VAE)(22), denoising auto-encoder (DAE) (53), and sparse auto-encoder (SAE) (40). In this paper, we design the deep AE and its derivatives for 1D input data and two dimension (2D) input data, respectively. Considering different features of neural networks, the structures and hyper-parameters of them shown in Fig. 3 change adaptively. Specifically, the network structures of DAE and SAE are the same with AE, and the differences are the loss function and inputs. During the training of AE and its derivatives, the encoder and decoder are trained jointly to get the low-dimensionality features of data. After that, the encoder and classifier are trained jointly for the classification task. Besides, in Fig. 3, the MSE loss means the mean square error loss, Conv means the convolutional layer,
means the transposed convolutional (e.g. inverse convolution) layer, and the KLP loss means the Kullback-Leibler divergence loss.
Convolutional neural network (CNN) (25) was first proposed in 1997 and the proposed network was also called LeNet. CNN is a specialized kind of the neural network for processing data that have a known grid-like topology. Sparse interactions, parameter sharing, and equivalent representations are realized with convolution and pooling operations on CNN. In 2012, AlexNet (23)
won the title in the ImageNet competition by far surpassing the second place, and deep CNN has attracted wide attention. Besides, in 2016, ResNet(16) was proposed and its classification accuracy exceeded the human baseline. In this paper, we design 5 layers 1D CNN and 2D CNN for 1D input data and 2D input data, respectively, and also adapt three well known CNN models (LeNet, ResNet18, and AlexNet) for two types of input data. The details of them are shown in Fig. 4. In Fig. 4
, MaxPool means the Max Pooling layer, AdaptiveMaxPool means the Adaptive Max Pooling layer, and Dropout means the Dropout layer.
Recurrent neural network (RNN) can describe the temporal dynamic behavior and is very suitable to deal with the time series. However, RNN often exists the gradient vanishing and exploding problems during the training. To overcome these problems, Long Short-term Memory Network(LSTM) was proposed in 1997(20) for processing continual input streams and has made great success in various fields such as NLP, etc. Bi-directional LSTM (BiLSTM) can capture bidirectional dependencies over long distances and learn to remember and forget information selectively. We utilize BiLSTM as the representation of RNN to deal with two types of input data (1D and 2D) for the classification task. The details of BiLSTM are shown in Fig. 5. Besides, in Fig. 5, Transpose means transposing the channel and feature dimensions of the input data, and BiLSTM Block means the BiLSTM layer.
In the field of intelligent fault diagnosis, publicly available datasets have not been investigated in depth. Actually, for comprehensive performance comparisons and evaluation, it is important to gather different kinds of representative datasets. We collected nine commonly used datasets which all have specific labels and explanations in addition to the PHM 2012 bearing dataset and IMS bearing dataset, so PHM 2012 and IMS are not suitable for fault classification that requires labels. To sum up, this paper uses seven datasets to verify the performance of models introduced in Section 3. The description of all these datasets is listed as follows.
CWRU datasets were provided by the Case Western Reserve University Bearing Data Center (5). Vibration signals were collected at 12 kHz or 48 kHz for normal bearings and damaged bearings with single-point defects under four different motor loads. Within each working condition, single-point faults were introduced with fault diameters of 0.007, 0.014, and 0.021 inches on the rolling element, the inner ring, and the outer ring, respectively. In this paper, we use the data collected from the drive end, and the sampling frequency is equivalent to 12 kHz. In Table 1, one health state bearing and three fault locations, including the inner ring fault, the rolling element fault, and the outer ring fault, are classified into ten categories (one health state and 9 fault states) according to different fault sizes.
|Health State||the normal bearing at 1791 rpm and 0 HP|
|Inner ring 1||0.007 inch inner ring fault at 1797 rpm and 0 HP|
|Inner ring 2||0.014 inch inner ring fault at 1797 rpm and 0 HP|
|Inner ring 3||0.021 inch inner ring fault at 1797 rpm and 0 HP|
|Rolling Element 1||0.007 inch rolling element fault at 1797 rpm and 0 HP|
|Rolling Element 2||0.014 inch rolling element fault at 1797 rpm and 0 HP|
|Rolling Element 3||0.021 inch rolling element fault at 1797 rpm and 0 HP|
|Outer ring 1||0.007 inch outer ring fault at 1797rpm and 0 HP|
|Outer ring 2||0.014 inch outer ring fault at 1797rpm and 0 HP|
|Outer ring 3||0.021 inch outer ring fault at 1797rpm and 0 HP|
MFPT datasets were provided by Society for Machinery Failure Prevention Technology (48). MFPT datasets consisted of three bearing datasets: 1) a baseline dataset sampled at 97656 Hz for six seconds in each file; 2) seven outer ring fault datasets sampled at 48828 Hz for three seconds in each file; 3) seven inner ring fault datasets sampled at 48828 Hz for three seconds in each file; 4) some other datasets which are not used in this paper (more detailed information can be referred to the website of MFPT datasets (48)). In Table 2, one health state bearing and two fault bearings including the inner ring fault and the rolling element fault are classified into ten categories (one health state and nine fault states) according to different loads.
|Health State||Fault-free bearing working at 270 lbs|
|Outer ring 1||Outer ring fault bearing working at 25 lbs|
|Outer ring 2||Outer ring fault bearing working at 50 lbs|
|Outer ring 3||Outer ring fault bearing working at 100 lbs|
|Outer ring 4||Outer ring fault bearing working at 150 lbs|
|Outer ring 5||Outer ring fault bearing working at 200 lbs|
|Outer ring 6||Outer ring fault bearing working at 250 lbs|
|Outer ring 7||Outer ring fault bearing working at 300 lbs|
|Outer ring 1||Inner ring fault bearing working at 0 lbs|
|Inner ring 2||Inner ring fault bearing working at 50 lbs|
|Inner ring 3||Inner ring fault bearing working at 100 lbs|
|Inner ring 4||Inner ring fault bearing working at 150 lbs|
|Inner ring 5||Inner ring fault bearing working at 200 lbs|
|Inner ring 6||Inner ring fault bearing working at 250 lbs|
|Inner ring 7||Inner ring fault bearing working at 300 lbs|
PU datasets were provided by the Paderborn University Bearing Data Center (28; 27), and PU datasets consisted of 32 sets of bearing current signals and vibration signals. As shown in Table 3, bearings are divided into: 1) six undamaged bearings; 2) twelve artificially damaged bearings; 3) fourteen bearings with real damages caused by accelerated lifetime tests. Each dataset was collected under four working conditions as shown in Table 4.
|Bearing Code||Fault Mode||Description||Bearing Code||Fault Mode||Description|
|K001||Health state||Run-in 50 h before test||KI07||Artificial inner ring fault (Level 2)||Made by electric engraver|
|K002||Health state||Run-in 19 h before test||KI08||Artificial inner ring fault (Level 2)||Made by electric engraver|
|K003||Health state||Run-in 1 h before test||KA04||Outer ring damage (single point + S + Level 1)||Caused by fatigue and pitting|
|K004||Health state||Run-in 5 h before test||KA15||Outer ring damage (single point + S + Level 1)||Caused by plastic deform and indentation|
|K005||Health state||Run-in 10 h before test||KA16||Outer ring damage (single point + R + Level 2)||Caused by fatigue and pitting|
|K006||Health state||Run-in 16 h before test||KA22||Outer ring damage (single point + S + Level 1)||Caused by fatigue and pitting|
|KA01||Artificial outer ring fault (Level 1)||Made by EDM||KA30||Outer ring damage (distributed + R + Level 1)||Caused by plastic deform and indentation|
|KA03||Artificial outer ring fault (Level 2)||Made by electric engraver||KB23||Outer ring and inner ring damage (single point + M + Level 2)||Caused by fatigue and pitting|
|KA05||Artificial outer ring fault (Level 1)||Made by electric engraver||KB24||Outer ring and inner ring damage (distributed + M + Level 3)||Caused by fatigue and pitting|
|KA06||Artificial outer ring fault (Level 2)||Made by electric engraver||KB27||Outer ring and inner ring damage (distributed + M + Level 1)||Caused by plastic deform and indentation|
|KA07||Artificial outer ring fault (Level 1)||Made by drilling||KI04||Inner ring damage (single point + M + Level 1)||Caused by fatigue and pitting|
|KA08||Artificial outer ring fault (Level 2)||Made by drilling||KI14||Inner ring damage (single point + M + Level 1)||Caused by fatigue and pitting|
|KA09||Artificial outer ring fault (Level 2)||Made by drilling||KI16||Inner ring damage (single point + S + Level 3)||Caused by fatigue and pitting|
|KI01||Artificial inner ring fault (Level 1)||Made by EDM||KI17||Inner ring damage (single point + R + Level 1)||Caused by fatigue and pitting|
|KI03||Artificial inner ring fault (Level 1)||Made by electric engraver||KI18||Inner ring damage (single point + S + Level 2)||Caused by fatigue and pitting|
|KI05||Artificial inner ring fault (Level 1)||Made by electric engraver||KI21||Inner ring damage (single point + S + Level 1)||Caused by fatigue and pitting|
|No.||Rotating speed (rpm)||Load torque (Nm)||Radial force (N)||Name of setting|
In this paper, since using all the data will cause huge computational time, we only use the data collected from real damaged bearings ( including KA04, KA15, KA16, KA22, KA30, KB23, KB24, KB27, KI14, KI16, KI17, KI18, and KI22) under the working condition N15_M07_F10 to carry out the performance verification. It is worth mentioning that since KI04 is the same as KI14 completely shown in Table 3, we delete KI04 and the total number of classes is thirteen. Besides, only vibration signals are used for testing the models.
UoC gear fault datasets were provided by the University of Connecticut (4), and UoC datasets were collected at 20 kHz. In this dataset, nine different gear conditions were introduced to the pinions on the input shaft, including healthy condition, missing tooth, root crack, spalling, and chipping tip with 5 different levels of severity. All the collected datasets are used and classified into nine categories (one health state and eight fault states) to test the performance.
XJTU-SY bearing datasets were provided by the Institute of Design Science and Basic Component at Xi’an Jiaotong University and the Changxing Sumyoung Technology Co. (57; 54). XJTU-SY datasets consisted of fifteen bearings run-to-failure data under three different working conditions. Data were collected at 2.56 kHz. A total of 32768 data points were recorded for each sampling, and the sampling period is equal to one minute. The details of bearing lifetime and fault elements are shown in Table 5. In this paper, we use all the data described in Table 6 and the total number of classes is fifteen. It should be noticed that we use collected data at the end of run-to-failure experiments.
|Bearing 1_1||2h 3min||Outer ring|
|Bearing 1_2||2h 41min||Outer ring|
|Bearing 1_3||2h 38min||Outer ring|
|Bearing 1_4||2h 2min||Cage|
|Bearing 1_5||52 min||Inner ring and Outer ring|
|Bearing 2_1||8h 11min||Inner ring|
|Bearing 2_2||2h 41min||Outer ring|
|Bearing 2_3||8h 53min||Cage|
|Bearing 2_4||42min||Outer ring|
|Bearing 2_5||5h 39min||Outer ring|
|Bearing 3_1||42h 18min||Outer ring|
|Bearing 3_2||41h 36min||Inner ring, Rolling element, Cage, and Outer ring|
|Bearing 3_3||6h 11min||Inner ring|
|Bearing 3_4||25h 15min||Inner ring|
|Bearing 3_5||1h 54min||Outer ring|
SEU gearbox datasets were provided by Southeast University (45; 46). SEU datasets contained two sub-datasets, including a bearing dataset and a gear dataset, which are both acquired on Drivetrain Dynamics Simulator (DDS). There are two kinds of working conditions with rotating speed - load configuration (RS-LC) set to be 20 Hz - 0 V and 30 HZ - 2 V shown in Table 6. The total number of classes is equal to twenty according to Table 6 under different working conditions. Within each file, there are eight rows of vibration signals, and we use the second row of vibration signals.
|Fault Mode||RS-LC||Fault Mode||RS-LC|
|Health Gear||20 Hz - 0 V||Health Bearing||20 Hz - 0 V|
|Health Gear||30 Hz - 2 V||Health Bearing||30 Hz - 2 V|
|Chipped Tooth||20 Hz - 0 V||Inner ring||20 Hz - 0 V|
|Chipped Tooth||30 Hz - 2 V||Inner ring||30 Hz - 2 V|
|Missing Tooth||20 Hz - 0 V||Outer ring||20 Hz - 0 V|
|Missing Tooth||30 Hz - 2 V||Outer ring||30 Hz - 2 V|
|Root Fault||20 Hz - 0 V||Inner + Outer ring||20 Hz - 0 V|
|Root Fault||30 Hz - 2 V||Inner + Outer ring||30 Hz - 2 V|
|Surface Fault||20 Hz - 0 V||Rolling Element||20 Hz - 0 V|
|Surface Fault||30 Hz - 2 V||Rolling Element||30 Hz - 2 V|
JNU bearing datasets were provided by Jiangnan University (31; 30). JNU datasets consisted of three bearing vibration datasets with different rotating speeds, and the data were collected at 50 kHz. As shown in Table 7, JNU datasets contained one health state and three fault modes which include inner ring fault, outer ring fault, and rolling element fault. Therefore, the total number of classes is equal to twelve according to different working conditions.
|Fault Mode||Rotating Speed||Fault Mode||Rotating Speed||Fault Mode||Rotating Speed|
|Health State||600 rpm||Health State||800 rpm||Health State||1000 rpm|
|Inner ring||600 rpm||Inner ring||800 rpm||Inner ring||1000 rpm|
|Outer ring||600 rpm||Outer ring||800 rpm||Outer ring||1000 rpm|
|Rolling Element||600 rpm||Rolling Element||800 rpm||Rolling Element||1000 rpm|
PHM 2012 bearing datasets were used for PHM IEEE 2012 Data Challenge (39; 38). In PHM 2012 datasets, seventeen run-to-failure datasets were provided including six training sets and eleven testing sets. Three different loads were considered. Vibration and temperature signals were gathered during all those experiments. Since no label on the types of failures was given, it is not used in this paper.
IMS bearing datasets were generated by the NSF I/UCR Center for Intelligent Maintenance Systems (26). IMS datasets were made up of three bearing datasets, and each of them contained vibration signals of four bearings installed on the different locations. At the end of the run-to-failure experiment, a defect occurred on one of the bearings. The failure occurred in the different locations of bearings. It is inappropriate to classify these failures simply using three classes, so IMS datasets are not evaluated in this paper.
The reason why DL is superior in fault classification lies in its excellent feature extraction ability and feature space transformation ability. Although it is an end-to-end learning method, the type of input data and the way of normalization have a great impact on its performance. The type of input data determines the difficulty of feature extraction, and the normalization method determines the difficulty of calculation. So, in this paper, effects of five different input types and three different normalization methods on the performance of DL models are discussed.
In the field of CV and NLP, commonly used input types consist of images and texts, while in intelligent fault diagnosis, what we collected directly is the time series. Therefore, many researchers use signal processing methods to map the time series to different domains to get a better input type. However, which input type is more suitable to the intelligent fault diagnosis is still an open question. In this paper, effects of different input types on model performance are discussed.
For the time domain input, vibration signals are directly used as the input without data preprocessing. In this paper, the length of each sample is equivalent to 1024 and the total number of samples can be obtained from Eq. 1. After generating samples, we take 80% of total samples as the training set and 20% of total samples as the testing set.
where is the length of each signal, is the total samples, and floor means rounding towards minus infinity.
For the frequency domain input, FFT is used to transform each sample from the time domain into the frequency domain shown in Eq. 2. After this operation, the length of data will be halved and the new sample can be expressed as:
where the operator represents transforming into the frequency domain and taking the first half of the result.
For the time-frequency domain input, Short-time Fourier Transform (STFT) is applied to each sample to obtain the time-frequency representation shown in Eq. 3. The Hanning window is used and the window length is set to 64. After this operation, the time-frequency representation (a 33x33 image) will be generated as:
where the operator represents transforming into the time-frequency domain.
For the wavelet domain input, continuous wavelet transform (CWT) is applied to each sample to obtain the wavelet domain representation shown in Eq. 4. Because CWT is time-consuming, the length of each sample is set to 100. After this operation, the wavelet coefficients (an 100x100 image) will be obtained as:
where the operator represents transforming into the wavelet domain.
For slicing image input, each sample is reshaped into a 32x32 image. After this operation, the new sample can be denoted as:
where the operator represents reshaping into a 32x32 image.
However, the above data preprocessing method has some problems for training AE models and CNN models in the following two aspects: 1) if AE models input a large 2D signal, it will lead the decoder to have difficulty in the reconstruction procedure and the reconstruction error is very large; 2) if CNN models input a small 2D signal, it will make CNN unable to extract appropriate features.
Therefore, we have made a compromise on the data size obtained by the above data preprocessing methods. The size of the time domain and the frequency domain input are unchanged as shown in Eq. 1 and Eq. 2. For the AE class, sizes of all 2D inputs are adjusted to 32x32, while for CNN models, sizes of signals after CWT, STFT, and slice image are adjusted to 300x300, 330x330, and 320x320, respectively. It should be noted that input sizes of CNN models can be different since we use the AdaptiveMaxPooling layer to adapt different input sizes.
Input normalization can control values of data to a certain range. It is the basic step in data preparing, which can facilitate the subsequent data processing and accelerate the convergence of DL models. Therefore, we discuss effects of three normalization methods on the performance of DL models.
Maximum-Minimum Normalization: This normalization method can be implemented as
where is the input sample, is the minimum value in , and is the maximum value in .
[-1-1] Normalization: This normalization method can be implemented as
Z-score Normalization: This normalization method can be implemented by as
where is the mean value of , and
is the standard deviation of.
Data augmentation, a common step in CV and NLP, might be important to make the training datasets more diverse and alleviate the learning difficulties caused by small sample problems. However, data augmentation for intelligent fault diagnosis has not been investigated in depth. It is also worth mentioning that the key challenge for data augmentation is to create the label-corrected samples from existing samples, and this procedure mainly depends on the domain knowledge. However, it is difficult to determine whether the generated samples are label-corrected. So, this paper provides some data augmentation techniques to reduce the concerns of other scholars. In addition, these data augmentation strategies are only a simple test and their applications still need to be studied in depth.
RandomAddGaussian: this strategy randomly adds Gaussian noise into the input signal formulated as follows:
where is the 1D input signal, and
is generated by Gaussian distribution.
RandomScale: this strategy randomly multiplies the input signal with a random factor which is formulated as follows:
where is the 1D input signal, and is a scaler following the distribution .
RandomStretch: this strategy resamples the signal into a random proportion and ensures the equal length by nulling and truncating.
RandomCrop: this strategy randomly covers partial signals which is formulated as follows:
where is the 1D input signal, and is the binary sequence whose subsequence of random position is zero. In this paper the length of subsequence is equal to 10.
RandomScale: this strategy randomly multiplies the input signal with a random factor which is formulated as follows:
where is the 2D input signal, and is a scaler following the distribution .
RandomCrop: this strategy randomly covers partial signals which is formulated as follows:
where is the 2D input signal, and is the binary sequence whose subsequence of random position is zero. In this paper the length of subsequence is equal to 20.
Due to the fact that 2D inputs in intelligent fault diagnosis often have clear physical meanings, data augmentation methods in the image processing are not suitable to directly transfer to intelligent fault diagnosis.
One common practice of data split in intelligent fault diagnosis is the random split strategy, and the diagram of this strategy is shown in Fig. 6. From this diagram, it can be observed that we stress the preprocessing step without overlap due to the fact that if the sample preparation process exists any overlap for samples, the evaluation of classification algorithms may have test leakage (it is also worth mentioning that if users split the training set and the testing set from the beginning of the preprocessing step, then they can use any processing to simultaneously deal with the training and testing sets, as shown in Fig. 7). In addition, many papers confuse the validation (val) set and the testing set. The formal way is that the training set is further splited into the training set and the validation set for the model selection. Fig. 6
shows the condition of 4-fold cross validation, and we often use the average accuracy of 4-fold cross validation to represent the generalization accuracy, if there is no testing set. In this paper, for testing convenience and time saving, we only use 1-fold validation and use the last epoch accuracy to represent the testing accuracy (we also list the maximum accuracy in the whole epochs for comparison). It is worth noting that some papers use the maximum accuracy of the validation set, and this strategy is also dangerous because the validation set is used to select the parameters accidentally.
For industrial data from rotating machinery, they are rarely random and are always sequential (they might contain trends or other temporal correlation). Therefore, it is more appropriate to split data according to time sequences (order split). The diagram of data split strategy according to time sequences is shown in Fig. 8. From this diagram, it can be observed that we split the training and testing sets with the time phase instead of splitting the data randomly. In addition, Fig. 8 also shows the condition of 4-fold cross validation with time. In the following study, we compare the results of this strategy with the random split strategy using the last epoch accuracy and the maximum accuracy in the whole epochs.
It is a rather challenging task to evaluate the performance of intelligent fault diagnosis algorithms with suitable evaluation metrics. In intelligent fault diagnosis, it has three standard evaluation metrics, which have been widely used, including the overall accuracy, the average accuracy, and the confusion matrix. In this paper, we only use the overall accuracy to evaluate the performance of algorithms. The overall accuracy is defined as the number of correctly classified samples divided by the total number of samples. The average accuracy is defined as the average classification accuracy of each category. It should be noted that each class in our datasets has the same number of samples, so the value of the overall accuracy is equivalent to that of the average accuracy.
Since the performance of DL-based intelligent diagnosis algorithms fluctuates during the training process, to obtain reliable results and show the best overall accuracy that the model can achieve, we repeated each experiment five times. Four indicators are used to assess the performance of models, including the mean and maximum values of the overall accuracy obtained by the last epoch (the accuracy in the last epoch can represent the real accuracy without any test leakage), and the mean and maximum values of the maximal overall accuracy (in fact, when we use the maximal accuracy, we also use the testing set to choose the best model). For simplicity, they can be denoted as Last-Mean, Last-Max, Best-Mean, and Best-Max.
In preparation stage, we use two strategies, including random split and order split, to divide the dataset into training and testing sets. For random split, a sliding window is used to truncate the vibration signal without any overlap and each data sample contains 1024 points. After the preparation, we randomly take 80% of samples as the training set and 20% of samples as the testing set. For order split, the former 80% of time series is taken as the time series for dividing the training set, and then the last 20% is taken for dividing the testing set. Then, in two time series, a sliding window is used to truncate the vibration signal without any overlap, and each sample contains 1024 points.
In order to verify how input types, data normalization methods, and data split methods affect the performance of models, we set up three configurations of experiments (shown in Table 8, Table 9 and Table 10
.) for each dataset. In model training, we use Adam as the optimizer and the softmax cross-entropy as the loss function. The learning rate and the batch size of each experiment are set to 0.001 and 64, respectively. Each model is trained for 100 epochs, and during the training procedure, model training and model testing are alternated. In addition, all the experiments are executed under Window 10 and Pytorch 1.1 through running on a computer with an Intel Core i7-9700K, GeForce RTX 2080Ti, and 16G RAM.
In this section, we will discuss the experimental results in depth. Complete results are shown in Appendix A. (the accuracies which are larger than 95% are bold.)
From the results, it can be observed that all datasets except the XJTU-SY dataset have some accuracies exceeding 95%. In addition, the accuracies of CWRU and SEU datasets can reach to 100%. The accuracy of XJTU-SY is much lower than others in all conditions, because XJTU-SY is a run-to-failure dataset and we only use the data at the end of the whole process (it may be hard to find the fail point easily and accurately). Besides, the diagnostic difficulty of seven datasets can be ranked according to the sum of the best accuracy and the worst accuracy in one certain condition. Results used for sorting come from samples with the randomly split strategy processed by FFT, the Z-score normalization, and data augmentation. As shown in Fig. 9, we can split the datasets into four levels of difficulty.
In all datasets, the frequency domain input always can achieve the highest accuracy followed by the time-frequency domain input since in the frequency domain, the noise is spread over the full frequency band and the fault information is much easier to be distinguished than that in the time domain. It is also worth mentioning that according to the computational load of CWT, we use the short length of samples to perform CWT and then upsample the wavelet coefficients. These steps may degrade the classification accuracies of CWT.
From the results, it can be observed that models, especially ResNet18 belonging to CNN, can achieve the best accuracy in some datasets including CWRU, JNU, PU, and SEU. However, for MFPT, UoC, and XJTU-SY, models belonging to AE can perform better than other models. This phenomenon may be caused by the size of the datasets and the overfitting problem. Therefore, not every dataset can get better results using a more complex model.
It is hard to conclude which data normalization method is the best one, and from the results, it can be observed that accuracies of different data normalization methods also depend on the used models and datasets. In general, Z-score normalization can make the models achieve the best accuracy.
According to the results, we can conclude that when the accuracies of datasets are already high enough, data augmentation methods may slightly degrade the performance because models have already fitted original datasets well. More augmentation methods may change the distribution of original data and make the learning process harder. However, when the accuracies of datasets are not very high, data augmentation methods improve the performance of models, especially for the time domain input. It should be noted that data augmentation methods designed in this paper may be more suitable for the time domain input. Therefore, researchers can design other various data augmentation methods for their specific inputs.
When the datasets are easy to deal with (CWRU and SEU), the results between random split and order split are similar. However, the accuracies of some datasets (PU and UoC) decrease sharply under the order split. What we should pay more attention to is that whether randomly splitting these datasets has the risk of test leakage. Maybe it is more suitable for splitting the datasets according to time sequences to verify the performance of designed models.
Although intelligent diagnosis algorithms can achieve high classification accuracies in many datasets, there are still many issues that need to be discussed. In this paper, we further discuss the following five issues including class imbalance, generalization ability, interpretability, few-shot learning, and model selection.
During operation of the rotating machinery, most of measured signals are in the normal state, and only a few of them are in the fault state. Fault modes often have different probabilities of happening. Meanwhile, working conditions also have different probabilities of happening. For example, the samples generated by the helicopter hover, cruise, and other flight conditions are naturally unbalanced under the influence of the flight time, and thus the classification of helicopter flight conditions is a typical class imbalance issue. Therefore, the class imbalance issue will occur when using intelligent algorithms in real applications. Recently, although some researchers have published some related papers using traditional imbalanced learning methods(63) or generative adversarial networks (35) to solve this problem, these studies are far from enough. In this paper, PU Bearing Datasets are used to simulate the class imbalance issue. In this experiment, we adopt ResNet18 as the experimental model and only use two kinds of input types (the time domain input and the frequency domain input). Besides, data augmentation methods are used and the normalization method is the Z-score normalization, while the dataset is randomly split. Three groups of datasets with different imbalance ratios are constructed, which are shown in Table 11.
|Fault mode||Training samples||Testing samples|
As shown in Table 11, three datasets (Group1, Group2, and Group3) are constituted with different imbalanced ratios. Group1 is a balanced dataset, and there is no imbalance for each state. In real applications, it is almost impossible to let the number of data samples be the same. We reduce the training samples of some fault modes in Group1 to construct Group2, and then the imbalanced classification is simulated. In Group3, the imbalance ratio between fault modes increases further. Group2 can be considered as a moderately imbalanced dataset, while Group3 can be considered as a highly imbalanced dataset.
Experimental results are shown in Fig. 10, and it can be observed that the overall accuracy in Group3 is much lower than that of Group1, which indicates that the class imbalance will greatly degrade the performance of models. To address the problem of class imbalance, data-level methods and classifier-level methods can be used (3). Oversampling and undersampling methods are the most commonly used data-level methods in DL and some methods for generating samples based on generative adversarial networks (GAN) have also been studied recently. For the classifier-level methods, thresholding-based methods are applied in the test phase to adjust the decision threshold of tthe classifier. Besides, cost-sensitive learning methods assign different weights to different classes to avoid the suppression of categories with a small number of samples. In the field of fault diagnosis, other methods based on physical meanings and fault attention need to be explored.
Many of the existing intelligent algorithms perform very well on one working condition, but the diagnostic performance tends to drop significantly on another working condition, and here, we call it the generalization problem. Recently, many researchers have used algorithms based on transfer learning strategies to solve this problem. To illustrate the weak generalization ability of the intelligent diagnosis algorithms, experiments are also carried out on the PU bearing dataset. Experiments use the data under three working conditions (N15_M07_F10, N09_M07_F10, N15_M01_F10). In these experiments, data under one working condition are used to train models, and data under another working condition are used to test the performance. A total of six groups of experiments are performed, and the detailed information is shown in Table12.
|Group||Data for training||Data for testing|
The experimental results are shown in Fig. 11. It can be concluded that in most cases, intelligent diagnosis algorithms trained on one working condition cannot perform well on another working condition, which means the generalization ability of algorithms is insufficient. In general, we expect our algorithms can adapt to the changes in working conditions or measurement situations since these changes occur frequently in real applications. Therefore, studies still need to be done on how to transfer the trained algorithms to different working conditions effectively.
Although intelligent diagnosis algorithms can achieve high diagnostic accuracy in their tasks, the interpretability of these models is often insufficient and these black box models will generate high risk results (43), which greatly reduces the reliability of results and limits their applications. Actually, some papers in intelligent fault diagnosis have noted this problem and attempted to propose some interpretable model (33; 32).
To point out that the intelligent diagnostic algorithm lacks interpretability, we perform three sets of experiments on the PU bearing dataset, and the datasets are shown in Table 13. In each set of experiments, we use two different sets of data, which have the same fault mode and are acquired under the same condition.
|Group||Bearing code||Training samples||Testing samples|
The results, in which intelligent algorithms can get very high diagnosis accuracies in each set of experiments, are shown in Fig. 12. Nevertheless, for each binary classification task, since the fault mode and the working condition at the time of acquisition are same between two classes, theoretically, methods should not be able to achieve such high accuracy. These expected results are exactly contrary to those of the experiment, which shows that models only learn the discrimination of different collection points and do not learn how to extract the essential characteristics of fault signals. Therefore, it is very important to figure out whether models can learn essential fault characteristics or just classify the different conditions of collected signals.
According to the development of interpretability in the computer science, we may be able to study the interpretability of DL-based models from the following aspects: (1) visualize the results of neurons to analyze the attention points of models(60); (2) add physical constraints to the loss function (51) to meet specific needs of fault feature extraction; (3) add prior knowledge to network structures and convolutions (41) or unroll the existing optimization algorithms (12) to extract corresponding fault features.
The rapid development of deep learning is associated with the big data era. However, in intelligent diagnosis, the amount of data is far from big data because of preciousness of fault data and the high cost of fault simulation experiments, especially for the key components. To manifest the influence of the number of samples on the classification accuracy, we use the PU bearing dataset to design the few-shot training pattern with six groups of different sample numbers in each class for training.
Results of the time domain input and the frequency domain input are shown in Fig. 13. It is shown that with the decrease of the sample number, the accuracy decreases sharply. As shown in Fig. 13, for the time domain input, the Best-Max accuracy decreases from 91.46% to 20.39% as the sample number decreases from 100 to 1. Meanwhile, the Best-Max accuracy decreases from 97.73% to 29.67% as the sample number decreases from 100 to 1 with the frequency domain input.
Although the accuracy can be increased after using FFT, it is still too low to be accepted when the number of samples is extremely small. It is necessary to develop methods based on few-shot learning to copy with the application scenarios with limited samples.
Many DL-based few-shot learning models have been proposed in recent years, most of these methods adopt a meta-learning paradigm by training networks with a large amount of tasks, which means that big data in other related fields are necessary for these methods. In the field of fault diagnosis, there is no relevant data with such a big size available, so methods embedding with physical mechanisms are required to address this problem effectively.
For intelligent fault diagnosis, designing a neural network is not the final goal, and our task is applying the model to real industrial applications, while designing a neural network is only a small part of our task. However, to achieve a good effect, we have to spend considerable time and energy on designing the corresponding networks. Because building a neural network is an iterative process consisting of repeated trial and error, and the performance of models should be fed back to us to adjust models. The single trial and error cost multiplied by the number of trial and error can easily reach a huge cost. Besides, reducing this cost is also the partial purpose of this benchmark study which provides some guidelines to choose a baseline model.
Actually, there is another way called neural architecture search (NAS) (9)
to avoid the huge cost of trial and error. NAS can allow to design a neural network automatically through searching for a specific network based on a specific dataset. A limited search space of the network is first constructed according to the physical prior. After that, a neural network matching a specific dataset is sampled from the search space through reinforcement learning, the evolutionary algorithm or the gradient strategy. Besides, the whole construction process does not require manual participation, which greatly reduces the cost of building a neural network and allows us to focus on specific engineering applications.
In this paper, we collect most of the publicly available datasets to evaluate the performance of MLP, AE, CNN, and RNN models from several perspectives. Based on the benchmark accuracies, we highlight some evaluation results which are very important for comparing or testing new models. First, not all datasets are suitable for comparing the classification effectiveness of the proposed methods since basic models can achieve very high accuracies on these datasets, like CWRU and SEU. Second, the frequency domain input can achieve the highest accuracy in all datasets, so researchers should first try to use the frequency domain as the input. Third, it is not necessary for CNN models to get the best results in all cases, and we should also consider the overfitting problem. Fourth, when the accuracies of datasets are not very high, data augmentation methods improve the performance of models, especially for the time domain input. Thus, more effective data augmentation methods need to be investigated. Finally, in some cases, maybe it is more suitable for splitting the datasets according to time sequences (order split) since random split may provide virtually high accuracies. It may be helpful to develop new models to take these evaluation results into consideration.
In addition, we release a code library for other researchers to test the performance of their own DL-based intelligent fault diagnosis models of these datasets. Through these works, we hope that the evaluation results and the code library can promote a better understanding of DL-based models, and provide a unified framework for generating more effective models. For further studies, we will focus on five listed issues (class imbalance, generalization ability, interpretability, few-shot learning, and model selection) to propose more customized works.
Deep transfer network with joint distribution adaptation: a new intelligent fault diagnosis framework for industry application. ISA transactions. Cited by: §10.2.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.3.
Efficient learning of sparse representations with an energy-based model. In Advances in neural information processing systems, pp. 1137–1144. Cited by: §3.2.
Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pp. 1096–1103. Cited by: §3.2.
A hybrid prognostics approach for estimating remaining useful life of rolling element bearings. IEEE Transactions on Reliability. Cited by: §4.5.