Deep Learning Algorithms for Rotating Machinery Intelligent Diagnosis: An Open Source Benchmark Study

03/06/2020 ∙ by Zhibin Zhao, et al. ∙ 0

With the development of artificial intelligence and deep learning (DL) techniques, rotating machinery intelligent diagnosis has gone through tremendous progress with verified success and the classification accuracies of many DL-based intelligent diagnosis algorithms are tending to 100%. However, different datasets, configurations, and hyper-parameters are often recommended to be used in performance verification for different types of models, and few open source codes are made public for evaluation and comparisons. Therefore, unfair comparisons and ineffective improvement may exist in rotating machinery intelligent diagnosis, which limits the advancement of this field. To address these issues, we perform an extensive evaluation of four kinds of models with various datasets to provide a benchmark study within the same framework. In this paper, we first gather most of the publicly available datasets and give the complete benchmark study of DL-based intelligent algorithms under two data split strategies, five input formats, three normalization methods, and four augmentation methods. Second, we integrate the whole evaluation codes into a code library and release this code library to the public for better development of this field. Third, we use the specific-designed cases to point out the existing issues, including class imbalance, generalization ability, interpretability, few-shot learning, and model selection. By these works, we release a unified code framework for comparing and testing models fairly and quickly, emphasize the importance of open source codes, provide the baseline accuracy (a lower bound) to avoid useless improvement, and discuss potential future directions in this field. The code library is available at <https://github.com/ZhaoZhibin/DL-based-Intelligent-Diagnosis-Benchmark>.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 10

page 11

page 12

page 22

page 27

page 28

page 39

Code Repositories

DL-based-Intelligent-Diagnosis-Benchmark

Source codes for the paper "Deep Learning Algorithms for Rotating Machinery Intelligent Diagnosis: An Open Source Benchmark Study"


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Prognostics health management (PHM) is one of the most essential systems in modern industrial equipment, such as helicopter, aero-engine, wind turbine, and high speed train. The main function of PHM systems used in rotating machinery is intelligent fault diagnosis for condition-based maintenance. Intelligent fault diagnosis is the key component of PHM systems and has been studied widely. Traditional intelligent diagnosis methods mainly consist of the feature extraction using various signal processing methods and the fault classification using various machine learning techniques. Although advanced signal processing methods (fast Fourier transform (FFT), spectrum kurtosis (SK), wavelet transform (WT), sparse representation, etc.) and machine learning algorithms (k-nearest neighbor (KNN), artificial neural network (ANN), support vector machine (SVM), etc.) have been successfully applied to intelligent diagnosis and have made considerable progress, it remains a challenging problem about how to perform diagnosis precisely and efficiently. With the development of online condition monitoring and data analysis systems, increasingly different kinds of real-time data are transferred from operating machines and the massive data are gained in the cloud. Facing with these heterogeneous massive data, feature extraction methods and mapping abilities from signals to conditions that are designed and chosen by experts, to a great extent depending on prior knowledge, are time-consuming and empirical.

Deep learning (DL) as a booming data mining technique has swept many fields including computer vision (CV)

(23; 10)

, natural language processing (NLP)

(18; 50; 59)

, etc. In 2006, the concept of DL was first introduced through proposing the deep belief network (DBN)

(17). In 2013, MIT Technology Review ranked the DL technology as the top ten breakthrough technologies (36). In 2015, a review (24)

published in nature stated that DL allows computational models composed of multiple processing layers to learn data representations with multiple levels of the abstraction. Due to its strong representation learning ability, DL is well-suited to data analysis. Therefore, in the field of intelligent fault diagnosis, many researchers have applied DL-based techniques, such as multi-layer perception (MLP), auto-encoder (AE), convolutional neural network (CNN), deep belief network (DBN), and recurrent neural network (RNN) to various fields. A large number of DL-based intelligent diagnosis algorithms have been proposed in recent years and their classification accuracies have been tending to 100%. However, when different researchers design DL-based intelligent diagnosis algorithms, they often recommend to use different inputs (like time domain input, frequency domain input, time-frequency domain input, wavelet domain input, slicing image input, etc.) and set different hyper-parameters (like the dimension of the input, the learning rate, the batch size, the network architecture, etc.). In addition, few authors make their codes available for evaluation and comparison, and others are difficult to repeat the results completely and correctly. Therefore, unfair comparisons and ineffective improvement may exist in this field. Considering that this field lacks open source codes and benchmark study, it is crucial to evaluate and compare different DL-based intelligent diagnosis algorithms to provide the benchmark or the lower bound of their accuracies and performance, thereby helping further studies in this field for more persuasive and appropriate algorithms.

For comprehensive performance comparisons and evaluation, it is important to gather different kinds of datasets. Actually, there exist several datasets for intelligent fault diagnosis. However, not every dataset provides a detailed description and is suited for the fault classification. For some datasets, the category discrimination is relatively large, and even one simple classifier can achieve acceptable results. Therefore, to thoroughly perform data mining and assess the difficulty of datasets, it is necessary to collect different datasets in a library and evaluate the performance of algorithms for different datasets on a unified platform.

In addition, one common issue in intelligent fault diagnosis is that for splitting data, and researchers often use the random split strategy. This strategy is dangerous since if the preparation process exists any overlap for samples, the evaluation of classification algorithms will have test leakage (42). As for industrial data, they are rarely random and are always sequential (they might contain trends in time domain). Therefore, it is more appropriate to split data according to time sequences (we simply call it order split) (42). Actually, order split is closer to reality, because we always use historical data to predict the future condition in industry. Conversely, if we randomly split the data, it might be possible for the diagnosis algorithms to record the future patterns, and this might cause another pitfall with test leakage.

In this paper, we first collect most of the publicly available datasets and discuss whether it is suitable for intelligent fault diagnosis. Second, we release a code library of the data preparation for all datasets which are suitable for fault classification and the whole evaluation framework with different input formats, normalization methods, data split ways, augmentation methods, and DL-based models. Meanwhile, we also use some datasets to discuss the existing issues in intelligent fault diagnosis including class imbalance, generalization ability, interpretability, few-shot learning, and model selection. To the best of our knowledge, this is the first work to comprehensively perform the benchmark study and release the code library of DL-based intelligent algorithms. In summary, this work mainly focuses on evaluating various DL-based intelligent diagnosis algorithms for most of the publicly available datasets from several perspectives, providing the benchmark accuracy (it is worth mentioning that the results are just a lower bound of accuracy) to avoid useless improvement, and releasing the code library for complete evaluation procedures. Through these works, we hope to make comparing and testing models fairer and quicker, emphasize the importance of open source codes and the benchmark study in this field, and provide some suggestions and discussions of future studies.

The contributions of this paper are listed as follows:

  1. Various datasets and data preparing. We gather most of the publicly available datasets and give the detailed discussion about its adaptability to DL-based intelligent diagnosis. For data preparing, we first discuss different kinds of input formats and different normalization methods for listed datasets. After that, we state that data augmentation which is a common step in CV and NLP might be important to make the training datasets more diverse, and we also try some kinds of data augmentation methods to clarify that they have not been fully investigated. Meanwhile, we also discuss the way of data split and state that it may be more appropriate to split data according to time sequences (also called order split).

  2. Benchmark accuracy and further studies. We evaluate various DL-based intelligent diagnosis algorithms including MLP, AE, CNN, and RNN for different datasets and provide the benchmark accuracy to make the future studies in this field more comparable and meaningful. We also use the experimental examples to discuss the existing problems in intelligent fault diagnosis including class imbalance, generalization ability, interpretability, few-shot learning, and model selection problems.

  3. Open source codes. For enhancing the importance and reproducibility of DL-based intelligent diagnosis algorithms, we release the whole evaluation codes in a code library for the better development of this field. At the same time, this is a unified intelligent fault diagnosis library, which retains an extended interface for everyone to load their own datasets and models by themselves to carry out new studies. The code library is available at https://github.com/ZhaoZhibin/DL-based-Intelligent-Diagnosis-Benchmark.

The outlines of the paper are listed as follows: In Section 2, we give a brief review of recent development of DL-based intelligent diagnosis algorithms. Then, Sections 3 to 9 discuss the evaluation algorithms, datasets, data preprocessing, data augmentation, data split, evaluation methodologies and evaluation results, respectively. After that, Section 10 makes some further discussions and the results, followed by conclusions in Section 11.

2 Brief Review

Recently, DL has become a promising method in a large scope of fields, and a huge amount of papers related to DL have been published since 2012. This paper mainly focuses on a benchmark study of intelligent fault diagnosis, rather than providing a comprehensive review on DL for other fields. Some famous DL researchers have published more professional references and interested readers can refer to (24; 11).

In the field of intelligent fault diagnosis, due to the efforts of many researchers in recent years, DL has become one of the most popular data-driven methods to perform fault diagnosis and health monitoring. In general, DL-based methods can extract representative features adaptively without any manual intervention and can achieve higher accuracy than traditional machine learning algorithms in most of the tasks when the dataset is large enough. We conducted a literature search using Web of Science with a database called web of science core collection. As shown in Fig. 1, it can be observed that the number of published papers related to DL-based intelligent algorithms increases year by year.


Figure 1:

The relationship between the number of published papers and publication years covering the last five years (as of November 2019). The basic descriptor is “TI= ((deep OR autoencoder OR convolutional network* OR neural network*) AND (fault OR condition monitoring OR health management OR intelligent diagnosis))”.

Another interesting observation is that many review papers on this topic have been published in the recent four years. Therefore, in this paper, we only briefly review and introduce the main contents of different review papers to allow readers who just enter this field to find suitable review papers quickly.

In bearing fault diagnosis, Li et al. (29) provided a systematic review of fuzzy formalisms including combination with other machine learning algorithms. Hoang et al. (19) provided a comprehensive review of three popular DL algorithms (AE, DBN, and CNN) for bearing fault diagnosis. Zhang et al. (61) systematically reviewed the machine learning and DL-based algorithms for bearing fault diagnosis and also provided a comparison of the classification accuracy of CWRU with different DL-based methods. Hamadache et al. (13) reviewed different fault modes of rolling element bearings and described various health indexes for PHM. Meanwhile, it also provided a survey of artificial intelligence methods for PHM including shallow learning and deep learning.

In rotating machinery intelligent diagnosis, Ali et al. (2) provided a review of AI-based methods using acoustic emission data for rotating machinery condition monitoring. Liu et al. (34)

reviewed Al-based approaches including KNN, SVM, ANN, Naive Bayes, and DL for fault diagnosis of rotating machinery. Wei et al.

(56) summarized early fault diagnosis of gears, bearings, and rotors through signal processing methods (adaptive decomposition methods, WT, and sparse decomposition) and AI-based methods (KNN, neural network, and SVM).

In machinery condition monitoring, Zhao et al. (64) and Duan et al. (7) reviewed diagnosis and prognosis of mechanical equipment based on DL algorithms such as DBN and CNN. Zhang et al. (62)

reviewed computational intelligent approaches including ANN, evolutionary algorithms, fuzzy logic, and SVM for machinery fault diagnosis. Zhao et al.

(65)

reviewed data-driven machine health monitoring through DL methods (AE, DBN, CNN, and RNN) and provided the data and codes (in Keras) about an experimental study.

In addition, Nasiri et al. (37) surveyed the state-of-the-art AI-based approaches for fracture mechanics and provided the accuracy comparisons achieved by different machine learning algorithms for mechanical fault detection. Tian et al. (52) surveyed different modes of traction induction motor fault and their diagnosis algorithms including model-based methods and AI-based methods. Khan et al. (21) provided a comprehensive review of AI for system health management and emphasized the trend of DL-based methods with limitations and benefits. Stetco et al. (49) reviewed machine learning approaches applied to wind turbine condition monitoring and made a discussion of the possibility for the future research. Ellefsen et al. (8) reviewed four well-established DL algorithms including AE, CNN, DBN, and LSTM for PHM applications and discussed the chances and challenges for the future studies, especially in the field of PHM in autonomous ships. AI-based algorithms (traditional machine learning algorithms and DL-based approaches) and applications (smart sensors, intelligent manufacturing, PHM, and cyber-physical systems) were reviewed in (1; 6; 55; 47) for smart manufacturing and manufacturing diagnosis.

Although a large body of DL-based methods and many related reviews have been published in the field of intelligent fault diagnosis, few studies thoroughly evaluate various DL-based intelligent diagnosis algorithms for most of the publicly available datasets, provide the benchmark accuracy, and release the code library for complete evaluation procedures. For example, a simple code written in Keras was published in (65), which is not comprehensive enough for different datasets and models. The accuracy comparisons were provided in (61; 37) according to existing papers, but they were not comprehensive enough due to different configurations and test conditions. Therefore, this paper is intended to make up for this gap and emphasize the importance of open source codes and the benchmark study in this field.

3 Evaluation Algorithm

A large amount of DL-based intelligent diagnosis methods have been published in the field of fault diagnosis and prognosis. It is impossible to cover all the published models since there is currently no open source community in this field. Therefore, we switch to test the performance of four categories of representative models (MLP, AE, CNN, and RNN) embedding some advanced techniques. It should be noted that DBN is also another commonly used DL methods for fault diagnosis, but we do not add it into this code library due to that the fact the training way of DBN is much different from those four categories.

3.1 Mlp

Multilayer Perception (MLP) (44)

, which was a fully connected network with one or more hidden layers, was proposed in 1987 as the prototype of an artificial neural network (ANN). With such a simple structure, MLP can complete some easy classification tasks such as MNIST. But as the task becomes more complex, it will be hard to train because of the huge amount of parameters. MLP with five fully connected layers and five batch normalization layers is used in this paper for the one dimension (1D) input data. The structure and parameters of the model are shown in Fig. 

2. Besides, in Fig. 2, FC means the fully connected layer, BN means the Batch Normalization layer, and CE loss means the softmax cross-entropy loss.


Figure 2: The structure of multilayer perception.

3.2 Ae

Auto-encoder(AE) was first proposed in 2006 as a method for dimensionality reduction. It can reduce the dimensionality of the input data while retaining most of the information in the data. AE consists of an encoder and a decoder, which tries to reconstruct the input from the output of the encoder, and the reconstruction error is used as a loss function. The encoder and decoder are trained to generate the low-dimension representation of the input and reconstruct the input from low-dimension representation, respectively. Subsequently, various derivatives of AE were proposed by researchers, such as variational auto-encoder (VAE)

(22), denoising auto-encoder (DAE) (53), and sparse auto-encoder (SAE) (40). In this paper, we design the deep AE and its derivatives for 1D input data and two dimension (2D) input data, respectively. Considering different features of neural networks, the structures and hyper-parameters of them shown in Fig. 3 change adaptively. Specifically, the network structures of DAE and SAE are the same with AE, and the differences are the loss function and inputs. During the training of AE and its derivatives, the encoder and decoder are trained jointly to get the low-dimensionality features of data. After that, the encoder and classifier are trained jointly for the classification task. Besides, in Fig. 3, the MSE loss means the mean square error loss, Conv means the convolutional layer,

means the transposed convolutional (e.g. inverse convolution) layer, and the KLP loss means the Kullback-Leibler divergence loss.


Figure 3: The structure of deep auto-encoder and its derivatives

3.3 Cnn

Convolutional neural network (CNN) (25) was first proposed in 1997 and the proposed network was also called LeNet. CNN is a specialized kind of the neural network for processing data that have a known grid-like topology. Sparse interactions, parameter sharing, and equivalent representations are realized with convolution and pooling operations on CNN. In 2012, AlexNet (23)

won the title in the ImageNet competition by far surpassing the second place, and deep CNN has attracted wide attention. Besides, in 2016, ResNet

(16) was proposed and its classification accuracy exceeded the human baseline. In this paper, we design 5 layers 1D CNN and 2D CNN for 1D input data and 2D input data, respectively, and also adapt three well known CNN models (LeNet, ResNet18, and AlexNet) for two types of input data. The details of them are shown in Fig. 4. In Fig. 4

, MaxPool means the Max Pooling layer, AdaptiveMaxPool means the Adaptive Max Pooling layer, and Dropout means the Dropout layer.


Figure 4: The structure of deep CNN and its derivatives

3.4 Recurrent Neural Network

Recurrent neural network (RNN) can describe the temporal dynamic behavior and is very suitable to deal with the time series. However, RNN often exists the gradient vanishing and exploding problems during the training. To overcome these problems, Long Short-term Memory Network(LSTM) was proposed in 1997

(20) for processing continual input streams and has made great success in various fields such as NLP, etc. Bi-directional LSTM (BiLSTM) can capture bidirectional dependencies over long distances and learn to remember and forget information selectively. We utilize BiLSTM as the representation of RNN to deal with two types of input data (1D and 2D) for the classification task. The details of BiLSTM are shown in Fig. 5. Besides, in Fig. 5, Transpose means transposing the channel and feature dimensions of the input data, and BiLSTM Block means the BiLSTM layer.


Figure 5: The structure of BiLSTM and its derivatives

4 Datasets

In the field of intelligent fault diagnosis, publicly available datasets have not been investigated in depth. Actually, for comprehensive performance comparisons and evaluation, it is important to gather different kinds of representative datasets. We collected nine commonly used datasets which all have specific labels and explanations in addition to the PHM 2012 bearing dataset and IMS bearing dataset, so PHM 2012 and IMS are not suitable for fault classification that requires labels. To sum up, this paper uses seven datasets to verify the performance of models introduced in Section 3. The description of all these datasets is listed as follows.

4.1 CWRU Bearing Dataset

CWRU datasets were provided by the Case Western Reserve University Bearing Data Center (5). Vibration signals were collected at 12 kHz or 48 kHz for normal bearings and damaged bearings with single-point defects under four different motor loads. Within each working condition, single-point faults were introduced with fault diameters of 0.007, 0.014, and 0.021 inches on the rolling element, the inner ring, and the outer ring, respectively. In this paper, we use the data collected from the drive end, and the sampling frequency is equivalent to 12 kHz. In Table 1, one health state bearing and three fault locations, including the inner ring fault, the rolling element fault, and the outer ring fault, are classified into ten categories (one health state and 9 fault states) according to different fault sizes.

Fault Mode Description:
Health State the normal bearing at 1791 rpm and 0 HP
Inner ring 1 0.007 inch inner ring fault at 1797 rpm and 0 HP
Inner ring 2 0.014 inch inner ring fault at 1797 rpm and 0 HP
Inner ring 3 0.021 inch inner ring fault at 1797 rpm and 0 HP
Rolling Element 1 0.007 inch rolling element fault at 1797 rpm and 0 HP
Rolling Element 2 0.014 inch rolling element fault at 1797 rpm and 0 HP
Rolling Element 3 0.021 inch rolling element fault at 1797 rpm and 0 HP
Outer ring 1 0.007 inch outer ring fault at 1797rpm and 0 HP
Outer ring 2 0.014 inch outer ring fault at 1797rpm and 0 HP
Outer ring 3 0.021 inch outer ring fault at 1797rpm and 0 HP
Table 1: Detailed description of CWRU datasets

4.2 MFPT Bearing Dataset

MFPT datasets were provided by Society for Machinery Failure Prevention Technology (48). MFPT datasets consisted of three bearing datasets: 1) a baseline dataset sampled at 97656 Hz for six seconds in each file; 2) seven outer ring fault datasets sampled at 48828 Hz for three seconds in each file; 3) seven inner ring fault datasets sampled at 48828 Hz for three seconds in each file; 4) some other datasets which are not used in this paper (more detailed information can be referred to the website of MFPT datasets (48)). In Table 2, one health state bearing and two fault bearings including the inner ring fault and the rolling element fault are classified into ten categories (one health state and nine fault states) according to different loads.

Fault Mode Description:
Health State Fault-free bearing working at 270 lbs
Outer ring 1 Outer ring fault bearing working at 25 lbs
Outer ring 2 Outer ring fault bearing working at 50 lbs
Outer ring 3 Outer ring fault bearing working at 100 lbs
Outer ring 4 Outer ring fault bearing working at 150 lbs
Outer ring 5 Outer ring fault bearing working at 200 lbs
Outer ring 6 Outer ring fault bearing working at 250 lbs
Outer ring 7 Outer ring fault bearing working at 300 lbs
Outer ring 1 Inner ring fault bearing working at 0 lbs
Inner ring 2 Inner ring fault bearing working at 50 lbs
Inner ring 3 Inner ring fault bearing working at 100 lbs
Inner ring 4 Inner ring fault bearing working at 150 lbs
Inner ring 5 Inner ring fault bearing working at 200 lbs
Inner ring 6 Inner ring fault bearing working at 250 lbs
Inner ring 7 Inner ring fault bearing working at 300 lbs
Table 2: Detailed description of MFPT datasets

4.3 PU Bearing Dataset

PU datasets were provided by the Paderborn University Bearing Data Center (28; 27), and PU datasets consisted of 32 sets of bearing current signals and vibration signals. As shown in Table 3, bearings are divided into: 1) six undamaged bearings; 2) twelve artificially damaged bearings; 3) fourteen bearings with real damages caused by accelerated lifetime tests. Each dataset was collected under four working conditions as shown in Table 4.

Bearing Code Fault Mode Description Bearing Code Fault Mode Description
K001 Health state Run-in 50 h before test KI07 Artificial inner ring fault (Level 2) Made by electric engraver
K002 Health state Run-in 19 h before test KI08 Artificial inner ring fault (Level 2) Made by electric engraver
K003 Health state Run-in 1 h before test KA04 Outer ring damage (single point + S + Level 1) Caused by fatigue and pitting
K004 Health state Run-in 5 h before test KA15 Outer ring damage (single point + S + Level 1) Caused by plastic deform and indentation
K005 Health state Run-in 10 h before test KA16 Outer ring damage (single point + R + Level 2) Caused by fatigue and pitting
K006 Health state Run-in 16 h before test KA22 Outer ring damage (single point + S + Level 1) Caused by fatigue and pitting
KA01 Artificial outer ring fault (Level 1) Made by EDM KA30 Outer ring damage (distributed + R + Level 1) Caused by plastic deform and indentation
KA03 Artificial outer ring fault (Level 2) Made by electric engraver KB23 Outer ring and inner ring damage (single point + M + Level 2) Caused by fatigue and pitting
KA05 Artificial outer ring fault (Level 1) Made by electric engraver KB24 Outer ring and inner ring damage (distributed + M + Level 3) Caused by fatigue and pitting
KA06 Artificial outer ring fault (Level 2) Made by electric engraver KB27 Outer ring and inner ring damage (distributed + M + Level 1) Caused by plastic deform and indentation
KA07 Artificial outer ring fault (Level 1) Made by drilling KI04 Inner ring damage (single point + M + Level 1) Caused by fatigue and pitting
KA08 Artificial outer ring fault (Level 2) Made by drilling KI14 Inner ring damage (single point + M + Level 1) Caused by fatigue and pitting
KA09 Artificial outer ring fault (Level 2) Made by drilling KI16 Inner ring damage (single point + S + Level 3) Caused by fatigue and pitting
KI01 Artificial inner ring fault (Level 1) Made by EDM KI17 Inner ring damage (single point + R + Level 1) Caused by fatigue and pitting
KI03 Artificial inner ring fault (Level 1) Made by electric engraver KI18 Inner ring damage (single point + S + Level 2) Caused by fatigue and pitting
KI05 Artificial inner ring fault (Level 1) Made by electric engraver KI21 Inner ring damage (single point + S + Level 1) Caused by fatigue and pitting
Table 3: Detailed description of PU datasets (S: single damage; R: repetitive damage; M: multiple damage)
No. Rotating speed (rpm) Load torque (Nm) Radial force (N) Name of setting
0 1500 0.7 1000 N15_M07_F10
1 900 0.7 1000 N09_M07_F10
2 1500 0.1 1000 N15_M01_F10
3 1500 0.7 400 N15_M07_F04
Table 4: Four working conditions of PU datasets

In this paper, since using all the data will cause huge computational time, we only use the data collected from real damaged bearings ( including KA04, KA15, KA16, KA22, KA30, KB23, KB24, KB27, KI14, KI16, KI17, KI18, and KI22) under the working condition N15_M07_F10 to carry out the performance verification. It is worth mentioning that since KI04 is the same as KI14 completely shown in Table 3, we delete KI04 and the total number of classes is thirteen. Besides, only vibration signals are used for testing the models.

4.4 UoC Gear Fault Dataset

UoC gear fault datasets were provided by the University of Connecticut (4), and UoC datasets were collected at 20 kHz. In this dataset, nine different gear conditions were introduced to the pinions on the input shaft, including healthy condition, missing tooth, root crack, spalling, and chipping tip with 5 different levels of severity. All the collected datasets are used and classified into nine categories (one health state and eight fault states) to test the performance.

4.5 XJTU-SY Bearing Dataset

XJTU-SY bearing datasets were provided by the Institute of Design Science and Basic Component at Xi’an Jiaotong University and the Changxing Sumyoung Technology Co. (57; 54). XJTU-SY datasets consisted of fifteen bearings run-to-failure data under three different working conditions. Data were collected at 2.56 kHz. A total of 32768 data points were recorded for each sampling, and the sampling period is equal to one minute. The details of bearing lifetime and fault elements are shown in Table 5. In this paper, we use all the data described in Table 6 and the total number of classes is fifteen. It should be noticed that we use collected data at the end of run-to-failure experiments.

Condition File Lifetime Fault element
Speed: 35 Hz
Load: 12 kN
Bearing 1_1 2h 3min Outer ring
Bearing 1_2 2h 41min Outer ring
Bearing 1_3 2h 38min Outer ring
Bearing 1_4 2h 2min Cage
Bearing 1_5 52 min Inner ring and Outer ring
Speed: 37.5 Hz
Load: 11 kN
Bearing 2_1 8h 11min Inner ring
Bearing 2_2 2h 41min Outer ring
Bearing 2_3 8h 53min Cage
Bearing 2_4 42min Outer ring
Bearing 2_5 5h 39min Outer ring
Speed: 40 Hz
Load: 10 kN
Bearing 3_1 42h 18min Outer ring
Bearing 3_2 41h 36min Inner ring, Rolling element, Cage, and Outer ring
Bearing 3_3 6h 11min Inner ring
Bearing 3_4 25h 15min Inner ring
Bearing 3_5 1h 54min Outer ring
Table 5: Detailed description of XJTU-SY datasets

4.6 SEU Gearbox Dataset

SEU gearbox datasets were provided by Southeast University (45; 46). SEU datasets contained two sub-datasets, including a bearing dataset and a gear dataset, which are both acquired on Drivetrain Dynamics Simulator (DDS). There are two kinds of working conditions with rotating speed - load configuration (RS-LC) set to be 20 Hz - 0 V and 30 HZ - 2 V shown in Table 6. The total number of classes is equal to twenty according to Table 6 under different working conditions. Within each file, there are eight rows of vibration signals, and we use the second row of vibration signals.

Fault Mode RS-LC Fault Mode RS-LC
Health Gear 20 Hz - 0 V Health Bearing 20 Hz - 0 V
Health Gear 30 Hz - 2 V Health Bearing 30 Hz - 2 V
Chipped Tooth 20 Hz - 0 V Inner ring 20 Hz - 0 V
Chipped Tooth 30 Hz - 2 V Inner ring 30 Hz - 2 V
Missing Tooth 20 Hz - 0 V Outer ring 20 Hz - 0 V
Missing Tooth 30 Hz - 2 V Outer ring 30 Hz - 2 V
Root Fault 20 Hz - 0 V Inner + Outer ring 20 Hz - 0 V
Root Fault 30 Hz - 2 V Inner + Outer ring 30 Hz - 2 V
Surface Fault 20 Hz - 0 V Rolling Element 20 Hz - 0 V
Surface Fault 30 Hz - 2 V Rolling Element 30 Hz - 2 V
Table 6: Detailed description of SEU datasets

4.7 JNU Bearing Dataset

JNU bearing datasets were provided by Jiangnan University (31; 30). JNU datasets consisted of three bearing vibration datasets with different rotating speeds, and the data were collected at 50 kHz. As shown in Table 7, JNU datasets contained one health state and three fault modes which include inner ring fault, outer ring fault, and rolling element fault. Therefore, the total number of classes is equal to twelve according to different working conditions.

Fault Mode Rotating Speed Fault Mode Rotating Speed Fault Mode Rotating Speed
Health State 600 rpm Health State 800 rpm Health State 1000 rpm
Inner ring 600 rpm Inner ring 800 rpm Inner ring 1000 rpm
Outer ring 600 rpm Outer ring 800 rpm Outer ring 1000 rpm
Rolling Element 600 rpm Rolling Element 800 rpm Rolling Element 1000 rpm
Table 7: Detailed description of JNU datasets

4.8 PHM 2012 Bearing Dataset

PHM 2012 bearing datasets were used for PHM IEEE 2012 Data Challenge (39; 38). In PHM 2012 datasets, seventeen run-to-failure datasets were provided including six training sets and eleven testing sets. Three different loads were considered. Vibration and temperature signals were gathered during all those experiments. Since no label on the types of failures was given, it is not used in this paper.

4.9 IMS Bearing Dataset

IMS bearing datasets were generated by the NSF I/UCR Center for Intelligent Maintenance Systems (26). IMS datasets were made up of three bearing datasets, and each of them contained vibration signals of four bearings installed on the different locations. At the end of the run-to-failure experiment, a defect occurred on one of the bearings. The failure occurred in the different locations of bearings. It is inappropriate to classify these failures simply using three classes, so IMS datasets are not evaluated in this paper.

5 Data Prepreocessing

The reason why DL is superior in fault classification lies in its excellent feature extraction ability and feature space transformation ability. Although it is an end-to-end learning method, the type of input data and the way of normalization have a great impact on its performance. The type of input data determines the difficulty of feature extraction, and the normalization method determines the difficulty of calculation. So, in this paper, effects of five different input types and three different normalization methods on the performance of DL models are discussed.

5.1 Input Types

In the field of CV and NLP, commonly used input types consist of images and texts, while in intelligent fault diagnosis, what we collected directly is the time series. Therefore, many researchers use signal processing methods to map the time series to different domains to get a better input type. However, which input type is more suitable to the intelligent fault diagnosis is still an open question. In this paper, effects of different input types on model performance are discussed.

5.1.1 Time Domain Input

For the time domain input, vibration signals are directly used as the input without data preprocessing. In this paper, the length of each sample is equivalent to 1024 and the total number of samples can be obtained from Eq. 1. After generating samples, we take 80% of total samples as the training set and 20% of total samples as the testing set.

(1)

where is the length of each signal, is the total samples, and floor means rounding towards minus infinity.

5.1.2 Frequency Domain Input

For the frequency domain input, FFT is used to transform each sample from the time domain into the frequency domain shown in Eq. 2. After this operation, the length of data will be halved and the new sample can be expressed as:

(2)

where the operator represents transforming into the frequency domain and taking the first half of the result.

5.1.3 Time-Frequency Domain Input

For the time-frequency domain input, Short-time Fourier Transform (STFT) is applied to each sample to obtain the time-frequency representation shown in Eq. 3. The Hanning window is used and the window length is set to 64. After this operation, the time-frequency representation (a 33x33 image) will be generated as:

(3)

where the operator represents transforming into the time-frequency domain.

5.1.4 Wavelet Domain Input

For the wavelet domain input, continuous wavelet transform (CWT) is applied to each sample to obtain the wavelet domain representation shown in Eq. 4. Because CWT is time-consuming, the length of each sample is set to 100. After this operation, the wavelet coefficients (an 100x100 image) will be obtained as:

(4)

where the operator represents transforming into the wavelet domain.

5.1.5 Slicing Image Input

For slicing image input, each sample is reshaped into a 32x32 image. After this operation, the new sample can be denoted as:

(5)

where the operator represents reshaping into a 32x32 image.

However, the above data preprocessing method has some problems for training AE models and CNN models in the following two aspects: 1) if AE models input a large 2D signal, it will lead the decoder to have difficulty in the reconstruction procedure and the reconstruction error is very large; 2) if CNN models input a small 2D signal, it will make CNN unable to extract appropriate features.

Therefore, we have made a compromise on the data size obtained by the above data preprocessing methods. The size of the time domain and the frequency domain input are unchanged as shown in Eq. 1 and Eq. 2. For the AE class, sizes of all 2D inputs are adjusted to 32x32, while for CNN models, sizes of signals after CWT, STFT, and slice image are adjusted to 300x300, 330x330, and 320x320, respectively. It should be noted that input sizes of CNN models can be different since we use the AdaptiveMaxPooling layer to adapt different input sizes.

5.2 Normalization

Input normalization can control values of data to a certain range. It is the basic step in data preparing, which can facilitate the subsequent data processing and accelerate the convergence of DL models. Therefore, we discuss effects of three normalization methods on the performance of DL models.

Maximum-Minimum Normalization: This normalization method can be implemented as

(6)

where is the input sample, is the minimum value in , and is the maximum value in .

[-1-1] Normalization: This normalization method can be implemented as

(7)

Z-score Normalization: This normalization method can be implemented by as

(8)

where is the mean value of , and

is the standard deviation of

.

6 Data Augmentation

Data augmentation, a common step in CV and NLP, might be important to make the training datasets more diverse and alleviate the learning difficulties caused by small sample problems. However, data augmentation for intelligent fault diagnosis has not been investigated in depth. It is also worth mentioning that the key challenge for data augmentation is to create the label-corrected samples from existing samples, and this procedure mainly depends on the domain knowledge. However, it is difficult to determine whether the generated samples are label-corrected. So, this paper provides some data augmentation techniques to reduce the concerns of other scholars. In addition, these data augmentation strategies are only a simple test and their applications still need to be studied in depth.

6.1 One Dimension Input Augmentation

RandomAddGaussian: this strategy randomly adds Gaussian noise into the input signal formulated as follows:

(9)

where is the 1D input signal, and

is generated by Gaussian distribution

.

RandomScale: this strategy randomly multiplies the input signal with a random factor which is formulated as follows:

(10)

where is the 1D input signal, and is a scaler following the distribution .

RandomStretch: this strategy resamples the signal into a random proportion and ensures the equal length by nulling and truncating.

RandomCrop: this strategy randomly covers partial signals which is formulated as follows:

(11)

where is the 1D input signal, and is the binary sequence whose subsequence of random position is zero. In this paper the length of subsequence is equal to 10.

6.2 Two Dimension Input Augmentation

RandomScale: this strategy randomly multiplies the input signal with a random factor which is formulated as follows:

(12)

where is the 2D input signal, and is a scaler following the distribution .

RandomCrop: this strategy randomly covers partial signals which is formulated as follows:

(13)

where is the 2D input signal, and is the binary sequence whose subsequence of random position is zero. In this paper the length of subsequence is equal to 20.

Due to the fact that 2D inputs in intelligent fault diagnosis often have clear physical meanings, data augmentation methods in the image processing are not suitable to directly transfer to intelligent fault diagnosis.

7 Data Split

One common practice of data split in intelligent fault diagnosis is the random split strategy, and the diagram of this strategy is shown in Fig. 6. From this diagram, it can be observed that we stress the preprocessing step without overlap due to the fact that if the sample preparation process exists any overlap for samples, the evaluation of classification algorithms may have test leakage (it is also worth mentioning that if users split the training set and the testing set from the beginning of the preprocessing step, then they can use any processing to simultaneously deal with the training and testing sets, as shown in Fig. 7). In addition, many papers confuse the validation (val) set and the testing set. The formal way is that the training set is further splited into the training set and the validation set for the model selection. Fig. 6

shows the condition of 4-fold cross validation, and we often use the average accuracy of 4-fold cross validation to represent the generalization accuracy, if there is no testing set. In this paper, for testing convenience and time saving, we only use 1-fold validation and use the last epoch accuracy to represent the testing accuracy (we also list the maximum accuracy in the whole epochs for comparison). It is worth noting that some papers use the maximum accuracy of the validation set, and this strategy is also dangerous because the validation set is used to select the parameters accidentally.


Figure 6: Random data splitting strategy with preprocessing without overlap.

Figure 7: Another condition with the training and testing sets split as the first step.

For industrial data from rotating machinery, they are rarely random and are always sequential (they might contain trends or other temporal correlation). Therefore, it is more appropriate to split data according to time sequences (order split). The diagram of data split strategy according to time sequences is shown in Fig. 8. From this diagram, it can be observed that we split the training and testing sets with the time phase instead of splitting the data randomly. In addition, Fig. 8 also shows the condition of 4-fold cross validation with time. In the following study, we compare the results of this strategy with the random split strategy using the last epoch accuracy and the maximum accuracy in the whole epochs.


Figure 8: Data split according to time sequences.

8 Evaluation Methodology

8.1 Evaluation Metrics

It is a rather challenging task to evaluate the performance of intelligent fault diagnosis algorithms with suitable evaluation metrics. In intelligent fault diagnosis, it has three standard evaluation metrics, which have been widely used, including the overall accuracy, the average accuracy, and the confusion matrix. In this paper, we only use the overall accuracy to evaluate the performance of algorithms. The overall accuracy is defined as the number of correctly classified samples divided by the total number of samples. The average accuracy is defined as the average classification accuracy of each category. It should be noted that each class in our datasets has the same number of samples, so the value of the overall accuracy is equivalent to that of the average accuracy.

Since the performance of DL-based intelligent diagnosis algorithms fluctuates during the training process, to obtain reliable results and show the best overall accuracy that the model can achieve, we repeated each experiment five times. Four indicators are used to assess the performance of models, including the mean and maximum values of the overall accuracy obtained by the last epoch (the accuracy in the last epoch can represent the real accuracy without any test leakage), and the mean and maximum values of the maximal overall accuracy (in fact, when we use the maximal accuracy, we also use the testing set to choose the best model). For simplicity, they can be denoted as Last-Mean, Last-Max, Best-Mean, and Best-Max.

8.2 Experimental Setting

In preparation stage, we use two strategies, including random split and order split, to divide the dataset into training and testing sets. For random split, a sliding window is used to truncate the vibration signal without any overlap and each data sample contains 1024 points. After the preparation, we randomly take 80% of samples as the training set and 20% of samples as the testing set. For order split, the former 80% of time series is taken as the time series for dividing the training set, and then the last 20% is taken for dividing the testing set. Then, in two time series, a sliding window is used to truncate the vibration signal without any overlap, and each sample contains 1024 points.

In order to verify how input types, data normalization methods, and data split methods affect the performance of models, we set up three configurations of experiments (shown in Table 8, Table 9 and Table 10

.) for each dataset. In model training, we use Adam as the optimizer and the softmax cross-entropy as the loss function. The learning rate and the batch size of each experiment are set to 0.001 and 64, respectively. Each model is trained for 100 epochs, and during the training procedure, model training and model testing are alternated. In addition, all the experiments are executed under Window 10 and Pytorch 1.1 through running on a computer with an Intel Core i7-9700K, GeForce RTX 2080Ti, and 16G RAM.

Table 8: Experiment setup 1
Table 9: Experiment setup 2
Table 10: Experiment setup 3

9 Evaluation Results

In this section, we will discuss the experimental results in depth. Complete results are shown in Appendix A. (the accuracies which are larger than 95% are bold.)

9.1 Results of Datasets

From the results, it can be observed that all datasets except the XJTU-SY dataset have some accuracies exceeding 95%. In addition, the accuracies of CWRU and SEU datasets can reach to 100%. The accuracy of XJTU-SY is much lower than others in all conditions, because XJTU-SY is a run-to-failure dataset and we only use the data at the end of the whole process (it may be hard to find the fail point easily and accurately). Besides, the diagnostic difficulty of seven datasets can be ranked according to the sum of the best accuracy and the worst accuracy in one certain condition. Results used for sorting come from samples with the randomly split strategy processed by FFT, the Z-score normalization, and data augmentation. As shown in Fig. 9, we can split the datasets into four levels of difficulty.


Figure 9: The level of dataset difficulty.

9.2 Results of Input Types

In all datasets, the frequency domain input always can achieve the highest accuracy followed by the time-frequency domain input since in the frequency domain, the noise is spread over the full frequency band and the fault information is much easier to be distinguished than that in the time domain. It is also worth mentioning that according to the computational load of CWT, we use the short length of samples to perform CWT and then upsample the wavelet coefficients. These steps may degrade the classification accuracies of CWT.

9.3 Results of Models

From the results, it can be observed that models, especially ResNet18 belonging to CNN, can achieve the best accuracy in some datasets including CWRU, JNU, PU, and SEU. However, for MFPT, UoC, and XJTU-SY, models belonging to AE can perform better than other models. This phenomenon may be caused by the size of the datasets and the overfitting problem. Therefore, not every dataset can get better results using a more complex model.

9.4 Results of Data Normalization

It is hard to conclude which data normalization method is the best one, and from the results, it can be observed that accuracies of different data normalization methods also depend on the used models and datasets. In general, Z-score normalization can make the models achieve the best accuracy.

9.5 Results of Data Augmentation

According to the results, we can conclude that when the accuracies of datasets are already high enough, data augmentation methods may slightly degrade the performance because models have already fitted original datasets well. More augmentation methods may change the distribution of original data and make the learning process harder. However, when the accuracies of datasets are not very high, data augmentation methods improve the performance of models, especially for the time domain input. It should be noted that data augmentation methods designed in this paper may be more suitable for the time domain input. Therefore, researchers can design other various data augmentation methods for their specific inputs.

9.6 Results of Splitting Data

When the datasets are easy to deal with (CWRU and SEU), the results between random split and order split are similar. However, the accuracies of some datasets (PU and UoC) decrease sharply under the order split. What we should pay more attention to is that whether randomly splitting these datasets has the risk of test leakage. Maybe it is more suitable for splitting the datasets according to time sequences to verify the performance of designed models.

10 Discussion

Although intelligent diagnosis algorithms can achieve high classification accuracies in many datasets, there are still many issues that need to be discussed. In this paper, we further discuss the following five issues including class imbalance, generalization ability, interpretability, few-shot learning, and model selection.

10.1 Class Imbalance

During operation of the rotating machinery, most of measured signals are in the normal state, and only a few of them are in the fault state. Fault modes often have different probabilities of happening. Meanwhile, working conditions also have different probabilities of happening. For example, the samples generated by the helicopter hover, cruise, and other flight conditions are naturally unbalanced under the influence of the flight time, and thus the classification of helicopter flight conditions is a typical class imbalance issue. Therefore, the class imbalance issue will occur when using intelligent algorithms in real applications. Recently, although some researchers have published some related papers using traditional imbalanced learning methods

(63) or generative adversarial networks (35) to solve this problem, these studies are far from enough. In this paper, PU Bearing Datasets are used to simulate the class imbalance issue. In this experiment, we adopt ResNet18 as the experimental model and only use two kinds of input types (the time domain input and the frequency domain input). Besides, data augmentation methods are used and the normalization method is the Z-score normalization, while the dataset is randomly split. Three groups of datasets with different imbalance ratios are constructed, which are shown in Table 11.

Fault mode Training samples Testing samples
Group1 Group2 Group3 Group1/2/3
KA04 125 125 125 125
KA15 125 75 50 125
KA16 125 75 50 125
KA22 125 75 50 125
KA30 125 37 25 125
KB23 125 37 25 125
KB24 125 37 25 125
KB27 125 25 6 125
KI14 125 25 6 125
KI16 125 25 6 125
KI17 125 12 2 125
KI18 125 12 2 125
KI21 125 12 2 125
Table 11: Number of samples in three groups of imbalanced datasets

As shown in Table 11, three datasets (Group1, Group2, and Group3) are constituted with different imbalanced ratios. Group1 is a balanced dataset, and there is no imbalance for each state. In real applications, it is almost impossible to let the number of data samples be the same. We reduce the training samples of some fault modes in Group1 to construct Group2, and then the imbalanced classification is simulated. In Group3, the imbalance ratio between fault modes increases further. Group2 can be considered as a moderately imbalanced dataset, while Group3 can be considered as a highly imbalanced dataset.

Experimental results are shown in Fig. 10, and it can be observed that the overall accuracy in Group3 is much lower than that of Group1, which indicates that the class imbalance will greatly degrade the performance of models. To address the problem of class imbalance, data-level methods and classifier-level methods can be used (3). Oversampling and undersampling methods are the most commonly used data-level methods in DL and some methods for generating samples based on generative adversarial networks (GAN) have also been studied recently. For the classifier-level methods, thresholding-based methods are applied in the test phase to adjust the decision threshold of tthe classifier. Besides, cost-sensitive learning methods assign different weights to different classes to avoid the suppression of categories with a small number of samples. In the field of fault diagnosis, other methods based on physical meanings and fault attention need to be explored.


Figure 10: Experimental results of three groups of datasets. (a) time domain input, and (b) frequency domain input.

10.2 Generalization ability

Many of the existing intelligent algorithms perform very well on one working condition, but the diagnostic performance tends to drop significantly on another working condition, and here, we call it the generalization problem. Recently, many researchers have used algorithms based on transfer learning strategies to solve this problem. To illustrate the weak generalization ability of the intelligent diagnosis algorithms, experiments are also carried out on the PU bearing dataset. Experiments use the data under three working conditions (N15_M07_F10, N09_M07_F10, N15_M01_F10). In these experiments, data under one working condition are used to train models, and data under another working condition are used to test the performance. A total of six groups of experiments are performed, and the detailed information is shown in Table

12.

Group Data for training Data for testing
Group1 N15_M07_F10 N09_M07_F10
Group2 N15_M07_F10 N15_M01_F10
Group3 N09_M07_F10 N15_M07_F10
Group4 N09_M07_F10 N15_M01_F10
Group5 N15_M01_F10 N15_M07_F10
Group6 N15_M01_F10 N09_M07_F10
Table 12: Training data and testing data for each experiment

The experimental results are shown in Fig. 11. It can be concluded that in most cases, intelligent diagnosis algorithms trained on one working condition cannot perform well on another working condition, which means the generalization ability of algorithms is insufficient. In general, we expect our algorithms can adapt to the changes in working conditions or measurement situations since these changes occur frequently in real applications. Therefore, studies still need to be done on how to transfer the trained algorithms to different working conditions effectively.

Two excellent review papers (66; 58) and other applications (14; 15) published recently pointed out several potential research directions which could be considered and studied further to improve the generalization ability.


Figure 11: Experimental results of working conditions transfer. (a) time domain input, and (b) frequency domain input.

10.3 Interpretability

Although intelligent diagnosis algorithms can achieve high diagnostic accuracy in their tasks, the interpretability of these models is often insufficient and these black box models will generate high risk results (43), which greatly reduces the reliability of results and limits their applications. Actually, some papers in intelligent fault diagnosis have noted this problem and attempted to propose some interpretable model (33; 32).

To point out that the intelligent diagnostic algorithm lacks interpretability, we perform three sets of experiments on the PU bearing dataset, and the datasets are shown in Table 13. In each set of experiments, we use two different sets of data, which have the same fault mode and are acquired under the same condition.

Group Bearing code Training samples Testing samples
Group1 KA03 200 50
KA06 200 50
Group2 KA08 200 50
KA09 200 50
Group3 KI07 200 50
KI08 200 50
Table 13: The bearing code and the number of samples used in each experiment

The results, in which intelligent algorithms can get very high diagnosis accuracies in each set of experiments, are shown in Fig. 12. Nevertheless, for each binary classification task, since the fault mode and the working condition at the time of acquisition are same between two classes, theoretically, methods should not be able to achieve such high accuracy. These expected results are exactly contrary to those of the experiment, which shows that models only learn the discrimination of different collection points and do not learn how to extract the essential characteristics of fault signals. Therefore, it is very important to figure out whether models can learn essential fault characteristics or just classify the different conditions of collected signals.


Figure 12: Experimental results of three groups of datasets. (a) time domain input, and (b) frequency domain input.

According to the development of interpretability in the computer science, we may be able to study the interpretability of DL-based models from the following aspects: (1) visualize the results of neurons to analyze the attention points of models

(60); (2) add physical constraints to the loss function (51) to meet specific needs of fault feature extraction; (3) add prior knowledge to network structures and convolutions (41) or unroll the existing optimization algorithms (12) to extract corresponding fault features.

10.4 Few-Shot Learning

The rapid development of deep learning is associated with the big data era. However, in intelligent diagnosis, the amount of data is far from big data because of preciousness of fault data and the high cost of fault simulation experiments, especially for the key components. To manifest the influence of the number of samples on the classification accuracy, we use the PU bearing dataset to design the few-shot training pattern with six groups of different sample numbers in each class for training.

Results of the time domain input and the frequency domain input are shown in Fig. 13. It is shown that with the decrease of the sample number, the accuracy decreases sharply. As shown in Fig. 13, for the time domain input, the Best-Max accuracy decreases from 91.46% to 20.39% as the sample number decreases from 100 to 1. Meanwhile, the Best-Max accuracy decreases from 97.73% to 29.67% as the sample number decreases from 100 to 1 with the frequency domain input.

Although the accuracy can be increased after using FFT, it is still too low to be accepted when the number of samples is extremely small. It is necessary to develop methods based on few-shot learning to copy with the application scenarios with limited samples.


Figure 13: Experimental results of different few-shot training patterns. (a) time domain input, and (b) frequency domain input.

Many DL-based few-shot learning models have been proposed in recent years, most of these methods adopt a meta-learning paradigm by training networks with a large amount of tasks, which means that big data in other related fields are necessary for these methods. In the field of fault diagnosis, there is no relevant data with such a big size available, so methods embedding with physical mechanisms are required to address this problem effectively.

10.5 Model selection

For intelligent fault diagnosis, designing a neural network is not the final goal, and our task is applying the model to real industrial applications, while designing a neural network is only a small part of our task. However, to achieve a good effect, we have to spend considerable time and energy on designing the corresponding networks. Because building a neural network is an iterative process consisting of repeated trial and error, and the performance of models should be fed back to us to adjust models. The single trial and error cost multiplied by the number of trial and error can easily reach a huge cost. Besides, reducing this cost is also the partial purpose of this benchmark study which provides some guidelines to choose a baseline model.

Actually, there is another way called neural architecture search (NAS) (9)

to avoid the huge cost of trial and error. NAS can allow to design a neural network automatically through searching for a specific network based on a specific dataset. A limited search space of the network is first constructed according to the physical prior. After that, a neural network matching a specific dataset is sampled from the search space through reinforcement learning, the evolutionary algorithm or the gradient strategy. Besides, the whole construction process does not require manual participation, which greatly reduces the cost of building a neural network and allows us to focus on specific engineering applications.

11 Conclusion

In this paper, we collect most of the publicly available datasets to evaluate the performance of MLP, AE, CNN, and RNN models from several perspectives. Based on the benchmark accuracies, we highlight some evaluation results which are very important for comparing or testing new models. First, not all datasets are suitable for comparing the classification effectiveness of the proposed methods since basic models can achieve very high accuracies on these datasets, like CWRU and SEU. Second, the frequency domain input can achieve the highest accuracy in all datasets, so researchers should first try to use the frequency domain as the input. Third, it is not necessary for CNN models to get the best results in all cases, and we should also consider the overfitting problem. Fourth, when the accuracies of datasets are not very high, data augmentation methods improve the performance of models, especially for the time domain input. Thus, more effective data augmentation methods need to be investigated. Finally, in some cases, maybe it is more suitable for splitting the datasets according to time sequences (order split) since random split may provide virtually high accuracies. It may be helpful to develop new models to take these evaluation results into consideration.

In addition, we release a code library for other researchers to test the performance of their own DL-based intelligent fault diagnosis models of these datasets. Through these works, we hope that the evaluation results and the code library can promote a better understanding of DL-based models, and provide a unified framework for generating more effective models. For further studies, we will focus on five listed issues (class imbalance, generalization ability, interpretability, few-shot learning, and model selection) to propose more customized works.

References

  • [1] T. T. Ademujimi, M. P. Brundage, and V. V. Prabhu (2017) A review of current machine learning techniques used in manufacturing diagnosis. In IFIP International Conference on Advances in Production Management Systems, pp. 407–415. Cited by: §2.
  • [2] Y. H. Ali, S. M. Ali, R. A. Rahman, and R. I. R. Hamzah (2016) Acoustic emission and artificial intelligent methods in condition monitoring of rotating machine–a review. In National Conference For Postgraduate Research (NCON-PGR 2016), Cited by: §2.
  • [3] M. Buda, A. Maki, and M. A. Mazurowski (2018) A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks 106, pp. 249–259. Cited by: §10.1.
  • [4] P. Cao, S. Zhang, and J. Tang (Accessed 2019, September) Gear Fault Data, [Online]. Note: Available: https://doi.org/10.6084/m9.figshare.6127874.v1 Cited by: §4.4.
  • [5] (Accessed 2019, September) Case Western Reserve University (CWRU) Bearing Data Center, [Online]. Note: Available: https://csegroups.case.edu/bearingdatacenter/pages/download-data-file/ Cited by: §4.1.
  • [6] C. Chang, H. Lee, and C. Liu (2018) A review of artificial intelligence algorithms used for smart machine tools. Inventions 3 (3), pp. 41. Cited by: §2.
  • [7] L. Duan, M. Xie, J. Wang, and T. Bai (2018) Deep learning enabled intelligent fault diagnosis: overview and applications. Journal of Intelligent & Fuzzy Systems 35 (5), pp. 5771–5784. Cited by: §2.
  • [8] A. L. Ellefsen, V. Æsøy, S. Ushakov, and H. Zhang (2019) A comprehensive survey of prognostics and health management based on deep learning for autonomous ships. IEEE Transactions on Reliability 68 (2), pp. 720–740. Cited by: §2.
  • [9] T. Elsken, J. H. Metzen, and F. Hutter (2019) Neural architecture search: a survey.. Journal of Machine Learning Research 20 (55), pp. 1–21. Cited by: §10.5.
  • [10] C. Farabet, C. Couprie, L. Najman, and Y. LeCun (2012) Learning hierarchical features for scene labeling. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1915–1929. Cited by: §1.
  • [11] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT press. Cited by: §2.
  • [12] K. Gregor and Y. LeCun (2010) Learning fast approximations of sparse coding. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pp. 399–406. Cited by: §10.3.
  • [13] M. Hamadache, J. H. Jung, J. Park, and B. D. Youn (2019) A comprehensive review of artificial intelligence-based approaches for rolling element bearing phm: shallow and deep learning. JMST Advances 1 (1-2), pp. 125–151. Cited by: §2.
  • [14] T. Han, C. Liu, W. Yang, and D. Jiang (2019)

    Deep transfer network with joint distribution adaptation: a new intelligent fault diagnosis framework for industry application

    .
    ISA transactions. Cited by: §10.2.
  • [15] T. Han, C. Liu, W. Yang, and D. Jiang (2019) Learning transferable features in deep convolutional neural networks for diagnosing unseen machine conditions. ISA transactions 93, pp. 341–353. Cited by: §10.2.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §3.3.
  • [17] G. E. Hinton and R. R. Salakhutdinov (2006) Reducing the dimensionality of data with neural networks. science 313 (5786), pp. 504–507. Cited by: §1.
  • [18] J. Hirschberg and C. D. Manning (2015) Advances in natural language processing. Science 349 (6245), pp. 261–266. Cited by: §1.
  • [19] D. Hoang and H. Kang (2019) A survey on deep learning based bearing fault diagnosis. Neurocomputing 335, pp. 327–335. Cited by: §2.
  • [20] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.4.
  • [21] S. Khan and T. Yairi (2018) A review on the application of deep learning in system health management. Mechanical Systems and Signal Processing 107, pp. 241–265. Cited by: §2.
  • [22] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §3.2.
  • [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1, §3.3.
  • [24] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436. Cited by: §1, §2.
  • [25] Y. LeCun and Y. Bengio (1995) Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361 (10), pp. 1995. Cited by: §3.3.
  • [26] J. Lee, H. Qiu, G. Yu, and J. a. R. T. S. (. Lin (2007) Bearing data set. IMS, University of Cincinnati, NASA Ames Prognostics Data Repository, (https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/), NASA Ames Research Center, Moffett Field, CA. Cited by: §4.9.
  • [27] C. Lessmeier, J. K. Kimotho, D. Zimmer, and W. Sextro (2016) Condition monitoring of bearing damage in electromechanical drive systems by using motor current signals of electric motors: a benchmark data set for data-driven classification. In Proceedings of the European conference of the prognostics and health management society, pp. 05–08. Cited by: §4.3.
  • [28] C. Lessmeier, J. K. Kimotho, D. Zimmer, and W. Sextro (accessed on August 2019) KAt-DataCenter, Chair of Design and Drive Technology, Paderborn University. Note: https://mb.uni-paderborn.de/kat/forschung/datacenter/bearing-datacenter/ Cited by: §4.3.
  • [29] C. Li, J. L. V. de Oliveira, M. C. Lozada, D. Cabrera, V. Sanchez, and G. Zurita (2018) A systematic review of fuzzy formalisms for bearing fault diagnosis. IEEE Transactions on Fuzzy Systems. Cited by: §2.
  • [30] K. Li, X. Ping, H. Wang, P. Chen, and Y. Cao (2013) Sequential fuzzy diagnosis method for motor roller bearing in variable operating conditions based on vibration analysis. Sensors 13 (6), pp. 8013–8041. Cited by: §4.7.
  • [31] K. Li (accessed on August 2019) School of Mechanical Engineering, Jiangnan University. Note: http://mad-net.org:8765/explore.html?t=0.5831516555847212. Cited by: §4.7.
  • [32] T. Li, Z. Zhao, C. Sun, L. Cheng, X. Chen, R. Yan, and R. X. Gao (2019) WaveletKernelNet: an interpretable deep neural network for industrial intelligent diagnosis. arXiv preprint arXiv:1911.07925. Cited by: §10.3.
  • [33] X. Li, W. Zhang, and Q. Ding (2019) Understanding and improving deep learning-based rolling bearing fault diagnosis with attention mechanism. Signal Processing 161, pp. 136–154. Cited by: §10.3.
  • [34] R. Liu, B. Yang, E. Zio, and X. Chen (2018) Artificial intelligence for fault diagnosis of rotating machinery: a review. Mechanical Systems and Signal Processing 108, pp. 33–47. Cited by: §2.
  • [35] W. Mao, Y. Liu, L. Ding, and Y. Li (2019) Imbalanced fault diagnosis of rolling bearing based on generative adversarial network: a comparative study. IEEE Access 7, pp. 9515–9530. Cited by: §10.1.
  • [36] MIT Technology Review (accessed on August 2019) 10 Breakthrough Technologies 2013. Note: https://www.technologyreview.com/lists/technologies/2013/ Cited by: §1.
  • [37] S. Nasiri, M. R. Khosravani, and K. Weinberg (2017) Fracture mechanics and mechanical fault detection by artificial intelligence methods: a review. Engineering Failure Analysis 81, pp. 270–293. Cited by: §2, §2.
  • [38] P. Nectoux, R. Gouriveau, K. Medjaher, E. Ramasso, B. Chebel-Morello, N. Zerhouni, and C. Varnier (2012) PRONOSTIA: an experimental platform for bearings accelerated degradation tests.. In IEEE International Conference on Prognostics and Health Management, PHM’12., pp. 1–8. Cited by: §4.8.
  • [39] (Accessed 2019, September) PHM IEEE 2012 Data Challenge, [Online]. Note: Available: https://github.com/wkzs111/phm-ieee-2012-data-challenge-dataset Cited by: §4.8.
  • [40] M. Ranzato, C. Poultney, S. Chopra, and Y. L. Cun (2007)

    Efficient learning of sparse representations with an energy-based model

    .
    In Advances in neural information processing systems, pp. 1137–1144. Cited by: §3.2.
  • [41] M. Ravanelli and Y. Bengio (2018) Interpretable convolutional filters with sincnet. arXiv preprint arXiv:1811.09725. Cited by: §10.3.
  • [42] P. Riley (2019) Three pitfalls to avoid in machine learning. Nature Publishing Group. Cited by: §1.
  • [43] C. Rudin (2019) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1 (5), pp. 206–215. Cited by: §10.3.
  • [44] D. E. Rumelhart, G. E. Hinton, and R. J. Williams (1985) Learning internal representations by error propagation. Technical report California Univ San Diego La Jolla Inst for Cognitive Science. Cited by: §3.1.
  • [45] (Accessed 2019, September) SEU gearbox datasets, [Online]. Note: Available: https://github.com/cathysiyu/Mechanical-datasets Cited by: §4.6.
  • [46] S. Shao, S. McAleer, R. Yan, and P. Baldi (2018) Highly accurate machine fault diagnosis using deep transfer learning. IEEE Transactions on Industrial Informatics 15 (4), pp. 2446–2455. Cited by: §4.6.
  • [47] M. Sharp, R. Ak, and T. Hedberg Jr (2018) A survey of the advancing use and development of machine learning in smart manufacturing. Journal of manufacturing systems 48, pp. 170–179. Cited by: §2.
  • [48] (Accessed 2019, September) Society For Machinery Failure Prevention Technology, [Online]. Note: Available: https://mfpt.org/fault-data-sets/ Cited by: §4.2.
  • [49] A. Stetco, F. Dinmohammadi, X. Zhao, V. Robu, D. Flynn, M. Barnes, J. Keane, and G. Nenadic (2018) Machine learning methods for wind turbine condition monitoring: a review. Renewable energy. Cited by: §2.
  • [50] S. Sun, C. Luo, and J. Chen (2017) A review of natural language processing techniques for opinion mining systems. Information fusion 36, pp. 10–25. Cited by: §1.
  • [51] M. Tang, F. Perazzi, A. Djelouah, I. Ben Ayed, C. Schroers, and Y. Boykov (2018) On regularized losses for weakly-supervised cnn segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 507–522. Cited by: §10.3.
  • [52] Y. Tian, D. Guo, K. Zhang, L. Jia, H. Qiao, and H. Tang (2018) A review of fault diagnosis for traction induction motor. In 2018 37th Chinese Control Conference (CCC), pp. 5763–5768. Cited by: §2.
  • [53] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol (2008)

    Extracting and composing robust features with denoising autoencoders

    .
    In Proceedings of the 25th international conference on Machine learning, pp. 1096–1103. Cited by: §3.2.
  • [54] B. Wang, Y. Lei, N. Li, and N. Li (2018)

    A hybrid prognostics approach for estimating remaining useful life of rolling element bearings

    .
    IEEE Transactions on Reliability. Cited by: §4.5.
  • [55] J. Wang, Y. Ma, L. Zhang, R. X. Gao, and D. Wu (2018) Deep learning for smart manufacturing: methods and applications. Journal of Manufacturing Systems 48, pp. 144–156. Cited by: §2.
  • [56] Y. Wei, Y. Li, M. Xu, and W. Huang (2019) A review of early fault diagnosis approaches and their applications in rotating machinery. Entropy 21 (4), pp. 409. Cited by: §2.
  • [57] (Accessed 2019, September) XJTU-SY Bearing Datasets, [Online]. Note: Available: http://biaowang.tech/xjtu-sy-bearing-datasets/ Cited by: §4.5.
  • [58] R. Yan, F. Shen, C. Sun, and X. Chen (2019) Knowledge transfer for rotary machine fault diagnosis. IEEE Sensors Journal. Cited by: §10.2.
  • [59] T. Young, D. Hazarika, S. Poria, and E. Cambria (2018) Recent trends in deep learning based natural language processing. ieee Computational intelligenCe magazine 13 (3), pp. 55–75. Cited by: §1.
  • [60] M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §10.3.
  • [61] S. Zhang, S. Zhang, B. Wang, and T. G. Habetler (2019) Machine learning and deep learning algorithms for bearing fault diagnostics-a comprehensive review. arXiv preprint arXiv:1901.08247. Cited by: §2, §2.
  • [62] W. Zhang, M. Jia, L. Zhu, and X. Yan (2017) Comprehensive overview on computational intelligence techniques for machinery condition monitoring and fault diagnosis. Chinese Journal of Mechanical Engineering 30 (4), pp. 782–795. Cited by: §2.
  • [63] Y. Zhang, X. Li, L. Gao, L. Wang, and L. Wen (2018) Imbalanced data fault diagnosis of rotating machinery using synthetic oversampling and feature learning. Journal of manufacturing systems 48, pp. 34–50. Cited by: §10.1.
  • [64] G. Zhao, G. Zhang, Q. Ge, and X. Liu (2016) Research advances in fault diagnosis and prognostic based on deep learning. In 2016 Prognostics and System Health Management Conference (PHM-Chengdu), pp. 1–6. Cited by: §2.
  • [65] R. Zhao, R. Yan, Z. Chen, K. Mao, P. Wang, and R. X. Gao (2019) Deep learning and its applications to machine health monitoring. Mechanical Systems and Signal Processing 115, pp. 213–237. Cited by: §2, §2.
  • [66] H. Zheng, R. Wang, Y. Yang, J. Yin, Y. Li, Y. Li, and M. Xu (2019) Cross-domain fault diagnosis using knowledge transfer strategy: a review. IEEE Access 7, pp. 129260–129290. Cited by: §10.2.