Digitization of industrial assets gives modern maintenance systems access to larger amount of condition monitoring data at a lower cost. With the help of increased availability of these collected data, data-driven fault diagnosis methods have shown great potential in extracting system health information from complex data of varied nature. In recent years, building on the success of data-driven methods, the ambitious goal of transferring fault diagnosis models from one machine to the other has raised great interests. If such a problem is solved, the industry can save a considerable amount of effort on manually labeling data and modifying models for new machines in the same fleet. A successful solution to the problem can potentially save both time and fortune for the industry. One underlying problem of data-driven methods on these transferring tasks, as with many other application areas of data-driven methods, is its strong requirement on the quality of data. The lack of representativeness of the training data can dramatically affect the model performance on the target machine. If the target machine operates on a different working condition other than the one observed in training data, the model performance may degrade dramatically.
Deep neural networks, as one of the popular data-driven methods, especially suffer from this problem. Recent research 
has shown that deep networks are able to memorize the entire data-set even when random labels are given. This strong capacity of memorization can lead to poor generalization performance on new machines as well as on new operating conditions. Deep models are thus likely to overfit to the given training set that might be unrepresentative and cannot generalize well to new data in reality. Specifically, in a fault diagnosis context, a carefully trained deep model is likely to degrade on a newly deployed machine in the same fleet because of different environmental and operating conditions. Another special case is when operating conditions of a machine change over time, deep models are likely to classify new operating conditions as faults simply because it has not observed similar patterns in training period.
One intuitive solution to this problem is to add data under new operating conditions to the training set, and re-train the model. However, this is usually infeasible due to the fact only limited data are available under new operating conditions and new machines, and the reality that these new data are often unlabeled makes the problem even harder.
Domain adaptation methods are designed to tackle this kind of dilemma where two different machines are involved. Domain adaptation methods aim to leverage a small amount of unlabeled data under new operating conditions, and improve the model’s generalization ability. As the methods aims at transferring results achieved on a first domain with labeled data under given operating conditions, to a second domain with unlabeled data and different operating conditions, it is referred to as ”domain adaptation”. It has been a widely discussed topic in fields such as computer vision and natural language understanding [2, 3, 4, 5, 6]. Inspired by the successful implementation of Domain-Adversarial Neural Networks (DANN) , we propose to make use of its ability to alleviate domain difference for fault diagnosis problems.
Over the last few years, several fault diagnosis papers [7, 8, 9] also proposed to apply other domain adaptation methods to improve model performance on new operating conditions. These recent attempts raise a natural question: Are domain adaptation methods applicable in realistic fault diagnosis settings? How well do they perform comparing to each other? In this paper, we argue that previous papers have not answered these questions sufficiently. A fair evaluation across different methods requires careful choice of network structures, data preprocessing, training strategy, etc. The aim of this paper is to answer these questions by using a unified experimental protocol on a popular dataset, the Case Western Reserve University (CWRU) dataset for rolling element bearings in rotating machinery. We believe the proposed protocol shows the future potential of domain adaptation methods in fault diagnosis.
Ii Related Work
Deep learning methods [10, 11, 12, 13], have attracted large amount of attention by promising better performance without the need of hand-craft features. However, it is known that when a trained model is deployed on unseen operating conditions, the performance can deteriorate dramatically because of the operating condition difference, in other words, data distribution difference, between training and testing machines.
In previous works [14, 15], this difference is often called domain shift, where training data under observed operating conditions are considered as source domain, and newly collected data under new operating conditions are considered as target domain. The domain shift problem has been widely discussed in other fields such as computer vision [16, 17]. To alleviate the effect of domain shift in the input space, one motivation is to align the distributions in intermediate feature space, this intuition leads to a series of domain adaptation methods [2, 3, 4, 5, 6]. For example,  proposes to learn transfer components across domains. Deep Adaptation Network (DAN) method 
proposes to minimize domain discrepancy by minimizing the Maximum Mean Discrepancy (MMD) between source and target layers. Driven by similar motivation, Adaptive Batch Normalization (AdaBN) aligns the distributions through a modified batch normalization layer and calculate batch normalization statistics separately for source and target data. Along with the success of adversarial training on other tasks, DANN  proposes to align the distributions by adopting a domain discriminator and training the model adversarially. Recently,  proposes to align distributions of source and target by utilizing the task-specific decision boundaries, and maximizing classifier discrepancy.
For fault diagnosis applications, existing papers usually focus on the case where unlabeled data in target domain are fully provided, and directly apply the above domain adaptation techniques to solve the problem. For example,  proposes to use AdaBN to learn a model with good anti-noise and domain adaptation ability on raw vibration signals. Similarly,  propose to align the distributions of intermediate layers between source and feature extractors by adversarial training.  consider the problem of fault detection within a fleet using unsupervised feature alignment. Recently,  uses MMD-minimization to align the full source and target distributions for rotationary machines.
Iii Problem Description
The main motivation behind domain adaptation in fault diagnosis is that, in industry, it is not uncommon to see a fleet of similar machines with similar purposes available. It would be beneficial to manually label the data from one single machine and later transfer model knowledge from this well-studied machine, to other newly deployed machines in the same fleet, given that machines in the same fleet share characteristics and features. However, the fact that these machines may be operated under different conditions, and not even necessarily by a single operator, makes the transfer hard in reality. This change of operating condition, can be described as the distribution difference between training and testing data.
Besides learning from labeled data from source machine, domain adaptation aims to leverage the limited data from target machine and try to improve the performance on target machine by taking these partial data from target machine into consideration. Under the ideal scenario, this should help the model to perform better on the target machine.
To evaluate the effectiveness of different domain adaptation methods in fault diagnosis applications, following the setup of most previous papers, we propose the following set up, based on how the fault diagnosis transfer problem with two machines had been formulated in previous papers.The first machine, denoted by source, has been operating for a long time. This made possible the collection of representative data on different faults. The second machine, denoted by target, has less data available, and they are unlabeled. The source and target machines share similar characteristics but are operating under different operating conditions. We further assume that these two machines share the same sets of fault types. The goal of the training is to improve the performance of the model on the target machine.
Iii-a Domain Adaptation Task
Formally, we consider our first domain adaptation task for fault diagnosis. Given:
Labeled training data from source machine
Unlabeled data from target machine
where are condition classes to predict, i.e. healthy state and various faulty states, and is the union of all possible classes . Labels of target data are unavailable during training. The target of the task is to train a model using labeled and unlabeled , and improves its performance on . We denote the ground truth labels as .
In this setup, we assume that the unlabeled data from target machine already covers most of the fault types, thus the label space is the same between and . This is the setup used by most previous domain adaptation papers in fault diagnosis.
We propose to evaluate several popular domain adaptation methods under a unified experimental protocol. In this section, we first introduce the shared backbone architecture we used in all our experiments. Then we introduce the domain adaptation methods that we are going to compare.
Iv-a Baseline Architecture
One main obstacle on comparing different domain adaptation in fault diagnosis is that different works use different architectures for their experiments, thus direct comparison on results is unfair due to the different capacities of networks. In this paper, we evaluate all domain adaptation methods using the same basic architecture to ensure a fair comparison.
takes input data and output a feature representation of the given data. It includes three 1-D convolutional layers. Each comes with a filter length of 3, and a hidden size of 10, following the sigmoid activation function, as well as a dropout layer with 0.5 as dropout rate. The representation is then flattened and passed through a fully-connected layer to get mapped into a predefined feature size. Following the original paper, the feature size of 256 is used.
We choose the architecture in Fig 2 because it composes a rather strong baseline for domain adaptation tasks. The effectiveness of the architecture is proved in , and also validated by our re-produced results.
We use a two layer classifier
after extracting feature representation of the input data. The first layer is 256 units fully-connected layer with ReLu activation and dropout. The second fully-connected layer then maps the signal into scores for each class. Finally, softmax cross-entropy loss is used for all our experiments. The classification loss shared by all our experiments are thus:
where is the softmax output of the basic backbone.
Iv-B Domain Adaptation Methods
We now introduce the three domain adaptation methods to be compared. The methods are chosen based on their applicability to deep models. Classic methods such as Transfer Component Analysis (TCA)  were not considered because of their inferior performance proved by experiments in . To our knowledge, it is the first time DANN method is introduced in a fault diagnosis context.
Iv-B1 Domain-Adversarial Neural Networks (DANN )
Since the operating conditions of source and target machines are different, if model is trained naively, it would be easy to distinguish a target machine feature from source. The main idea of adversarial distribution alignment methods is to tackle this problem by making the feature extractor unbiased on features from source and target machines. This is achieved by an idea closely related to GAN . By adding a discriminator and introducing adversarial training, DANN  is a method that aligns the source and target feature distributions and makes them hard to be distinguished.
During training, we aim to reduce the -Divergence between source and target feature distributions. Fortunately, the adversarial alignment method proposed in  for domain adaptation can effectively reducing
-Divergence by reversing gradients and changing the representation space. We modify their method for our semi-supervised learning scenario.
The neural network includes three component: a feature extractor , a label predictor , and a discriminator . The divergence reduction is achieved by introducing the discriminator to tell whether the features come from source or target data while asking the feature extractor to fool the discriminator. During learning stage, on one side, we are trying to achieve the traditional training objective that minimize the label prediction error. At the same time, we are also pushing the features to be invariant towards its origin, i.e. the divergence between and to be reduced. This is monitored by the discriminator, where a successful alignment should yield high domain prediction loss. Formally, this is equivalent to the following min-max problem:
The loss function is divided into two parts, label prediction loss and domain prediction loss. The first term is the usual supervised loss for labeled data, and intends to train the feature extractor and label predictor. The second term is an adversarial loss that ensures the features to be domain-invariant and thus aligns the two distributions.
In argmin step, we are minimizing the label prediction loss as well as maximizing the domain prediction loss to achieve a domain-invariant features. In maximization step, we are minimizing the domain prediction loss, and thus training the domain predictor to provide precise prediction of the origin of features. This min-max problem is solved by adding a gradient reverse layer between feature layer and discriminator as described in .
In all our experiments, we use a three layer fully-connected classifier as our discriminator. The first two layers have hidden size of 1024 with ReLu activation, while the last layer maps the signal into 2 classes: source and target. Cross entropy loss is used for the discriminator loss.
By using gradient reverse layer and the above setup, the loss function can be reformulated into:
Iv-B2 Maximum Mean Discrepancy (MMD) Minimization
Similar to DANN, MMD-minimization offers an alternative way to measure the discrepancy between source and target distributions. Unlike DANN which estimate thedivergence between distributions, MMD is defined as the squared distance between the kernel embeddings of marginal distributions in the Reproducing kernel Hilbert Space (RKHS). Formally,
where denotes the RKHS with a kernel k, and are labeled source and unlabeled target distributions.
In reality, the choice of kernel used in obtaining these embeddings is crucial to a successful estimation of the discrepancy. Multiple kernels of MMD are usually used to leverage different kernels and provide and effective estimation.
where is a Gaussian kernel with width . Following the settings in previous MMD works in fault diagnosis , we adopt Gaussian kernel widths of 1, 2, 4, 8, and 16. Previous works have shown that this choice of kernels with an equal weight is sufficient enough for our specific task.
The multi kernel MMD loss is then used as an additional loss along with the label prediction loss to align the feature distribution between source and target machines:
Iv-B3 Adaptive Batch Normalization (AdaBN) 
Before introducing AdaBN, we briefly review Batch Normalization (BN). BN layers are designed to alleviate internal covariate shifting by guaranteeing the input distribution of each layer remains unchanged across different mini-batches. Considering an intermediate representation , where is the batch size and is the dimension of features. The BN layer transforms a feature by:
where , and is the output of the BN layer. and
are parameters to be learned in the training process. The mean and variance statistics are calculated over mini-batch during training, but over the whole population on test time.
AdaBN is based on the simple assumption that the deterioration of models on the target machine is caused by a distribution discrepancy on intermediate layers. By adding batch normalization layers and replacing BN statistics from source data with those from target data, the distribution difference is expected to be reduced in each layer, thus, increasing the model’s performance on the target data. Apart from a small amount of Batch Normalization parameters, AdaBN requires no additional parameters and is easy to implement.
In our AdaBN experiments, the Batch Normalization layers are inserted after each convolutional layer in the feature extractor. After training, we fix , , and all other trainable variables, and finetune the batch normalization statistics , and using the target data.
V Case Study
We now present a case study on the CWRU bearing dataset set using the above methods. The case study is designed to make the comparison over methods applicable in realistic fault diagnosis settings. More specifically, we consider the following factors:
The basic backbone architectures are the same across different experiments, so that the capacity of the models does not affect the results.
All experiments share the same pre-processing steps to exclude the effect of the number of samples and augmentation methods.
The different models share a similar budget for hyper-parameter tuning.
A realistically chosen validation set is used for hyper-parameter tuning.
The CWRU bearing dataset  from Bearing Data Center of Case Western Reserve University is used in our experiments. The dataset is chosen because of its availability to the public and its popularity over a large number of previous papers, including studies in domain adaptation. Following the general setup used by most other bearing diagnosis papers, drive end accelerometer data are used in all our experiments.
Following the label definition setup used by , 10 bearing conditions are considered as shown in Table I. Three fault types are included: inner race fault (IF), ball fault (BF), and outer race fault (OF). Faults were introduced to the bearings using electro-discharge machining with fault diameters of 7 mils, 14 mils, 21 mils. In total, there are 9 fault states and one healthy state. The dataset was originally collected at 12 and 48 kHz. In all our experiments, we make use of data at 12 kHz sampling rate. If the data are not available at 12 kHz, we down-sample them to ensure a continuous 12 kHz sampling rate over all data points.
The CWRU dataset comprises data from four different loads, which we treat as four different working conditions . The domain adaptation is applied across different loads. In this section we denote Task as the setup where source domain is the working load 0 and target domain is the working load 1.
We mostly follow the same preprocessing steps as 
. It consists in truncating the signal first 120,000 points. They are divided into 200 sequences of 1024 points with some overlap between sequences. Using the Fast Fourier Transform, each sequence is converted into a vector of 512 Fourier coefficients. All data are then normalized by a simple normalization factor. The normalization factor is chosen between. Normalization factor for all experiments are the same. It is determined by using the one that maximizes the performance on the source-only baseline on the validation task.
To fairly evaluate domain adaptation methods for fault diagnosis applications, a strong baseline is critical. In our case study, we use the feature extractor along with a basic classifier as our baseline as shown in Fig 2, and train it using only source data. No additional target data are used in the baseline. To choose the hyper-parameters, we use the task as validation task to tune all models, because it is one of the most difficult tasks among all the transfer pairs. We use Adam optimizer with a learning rate of 0.0002. The general hyper-parameters are fixed and shared by all other experiments once the baseline model is optimized according to the validation task.
V-D Budgets for Method-specific Hyper-parameters
To fairly compare the different methods, equivalent budgets for hyper-parameters should be used for all models. For DANN models, we use as the pool of hyper-parameters for gradient reverse factor . For MMD models, we use , for the MMD discrepancy weight
. AdaBN does not require any additional hyper-parameter. We train all models for 2000 Epochs.
V-E Experimental Environment
NVIDIA GTX 1080 is used for all experiments. The main framework is written using Python and Tensorflow. We run all experiments five times and report average and maximum accuracy to reflect the model performance and stability.
V-F Experiment Results
In the following the experimental results of different domain adaptation methods are reported.
V-F1 Model Performance
By carefully tuning the basic backbone using the validation task, we report the average accuracy of 94.99% for CWRU dataset. We argue that this is a rather strong baseline, as it is stronger than that reported in previous papers [15, 8, 14], and close to some of the results reported in studies applying domain adaptation . By providing a strong baseline, our evaluation reflects more fairly the effectiveness and applicability of the discussed domain adaptation methods.
Under the assumption of availability of unlabeled data on target domain, all domain adaptation methods discussed in this paper are able to improve model performance. DANN yields very good results achieving over 99.0% of average accuracy on the target domain, suggesting a meaningful feature alignment and a successful adaptation. Similarly, the MMD approach improves the model performance on target data and achieves an average accuracy of 99.4% over all tasks. The only drawback of MMD method may arise in cases when the data size is larger that the training time can quadratically increase. AdaBN, as a simple method without any additional parameters, also improves the model performance, though not as significantly as the other methods. The advantage of AdaBN is that it could be easily combined with other domain adaptation methods without increasing the model complexity.
On the right side of Table II, we show results from previous works using similar approaches for comparison. These results are not directly comparable with other columns because each paper uses its own way to prepare the target test set. MMD-ML  uses a similar MMD setup as ours, except that they apply the MMD loss not only on the feature layer, but also on other intermediate layers. A2CNN  uses adversarial training for domain adaptation, and shares a similar idea as DANN. The key difference between A2CNN and DANN is that A2CNN implementation does not use a reverse gradient layer, but utilizes a two-step training for classifiers and discriminators. This requires a more careful tuning of the training strategy.
The missing cells in Table II filled with mean that the original papers do not report results on these tasks. In this paper, we report model performance on all available adaptation tasks on the CWRU dataset. We believe by doing so, it provides a better and fairer comparison among domain adaptation methods for fault diagnosis.
V-F2 Model Efficiency
Model efficiency is crucial in reality for fault diagnosis applications, as computational resources may be limited. In Table II, for each method, we report a training time for 2000 Epochs and model complexity in terms of trainable parameters. We believe that this brings more insight in characteristics of these methods. As explained in the method section, AdaBN is the fastest domain adaptation method among all three we introduced in this paper. A small amount of extra parameter is introduced by batch normalization layers in the AdaBN training, and asks for comparable small amount of additional time. MMD methods, on the other hand, ask for no additional trainable parameter, but require a significantly higher amount of time for training. The additional time results from the time-consuming procedure of MMD estimation in every training iteration. One additional problem of MMD-related methods is its quadratic time complexity with regard to sample size. This limits its application in a more general scenario, where more training points are available. DANN method requires more parameters because of the additional domain classifier. The training time, however, is significantly smaller than that of the MMD methods. By using gradient reverse layers, the adversarial training procedures fit into the standard gradient descent training of the neural networks, and thus the -divergence can be estimated efficiently.
DANN, MMD, and AdaBN are all able to improve the model performance on the target task. AdaBN requires few additional parameters and provides a moderate adaptation ability with a minimum additional computational cost. MMD, on the other side, yields the best results for the bearing dataset at the largest computational cost. It also has a potential problem on efficiently dealing with larger training sets. The DANN method we introduced to fault diagnosis from  can be considered as a good trade-off between accuracy and computational power. It provides us with competitive results with the help of a reasonable amount of additional computational cost.
In the present paper, we proposed to use DANN, an adversarial domain adaptation method, for supervised fault diagnosis tasks. We compared its performance with two other domain adaptation methods. To enable a fair comparison between the methods and to to evaluate their applicability and effectiveness for fault diagnosis problems in reality, we proposed a unified experimental procedure. All of the methods applied in this case study were able to improve model performance on target data, suggesting these domain adaptation methods provide an added value to fault diagnosis problems in real applications. DANN method provides competitive results using significantly less training time comparing to MMD, and yields superior results over AdaBN.
-  C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” arXiv preprint arXiv:1611.03530, 2016.
-  S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,” IEEE Transactions on Neural Networks, vol. 22, no. 2, pp. 199–210, 2011.
-  M. Long, Y. Cao, J. Wang, and M. I. Jordan, “Learning transferable features with deep adaptation networks,” arXiv preprint arXiv:1502.02791, 2015.
-  Y. Li, N. Wang, J. Shi, J. Liu, and X. Hou, “Revisiting batch normalization for practical domain adaptation,” arXiv preprint arXiv:1603.04779, 2016.
-  Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” arXiv preprint arXiv:1409.7495, 2014.
K. Saito, K. Watanabe, Y. Ushiku, and T. Harada, “Maximum classifier
discrepancy for unsupervised domain adaptation,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3723–3732.
-  W. Zhang, G. Peng, C. Li, Y. Chen, and Z. Zhang, “A new deep learning model for fault diagnosis with good anti-noise and domain adaptation ability on raw vibration signals,” Sensors, vol. 17, no. 2, p. 425, 2017.
-  B. Zhang, W. Li, M. Zhang, and Z. Tong, “Adversarial adaptive 1-d convolutional neural networks for bearing fault diagnosis under varying working condition,” arXiv preprint arXiv:1805.00778, 2018.
-  X. Li, W. Zhang, Q. Ding, and J.-Q. Sun, “Multi-layer domain adaptation method for rolling bearing fault diagnosis,” Signal Processing, vol. 157, pp. 180–197, 2019.
-  C. Li, R.-V. Sanchez, G. Zurita, M. Cerrada, D. Cabrera, and R. E. Vásquez, “Multimodal deep support vector classification with homologous features and its application to gearbox fault diagnosis,” Neurocomputing, vol. 168, pp. 119–127, 2015.
-  S. Zhai, Y. Cheng, W. Lu, and Z. Zhang, “Deep structured energy based models for anomaly detection,” arXiv preprint arXiv:1605.07717, 2016.
-  F. Jia, Y. Lei, J. Lin, X. Zhou, and N. Lu, “Deep neural networks: A promising tool for fault characteristic mining and intelligent diagnosis of rotating machinery with massive data,” Mechanical Systems and Signal Processing, vol. 72, pp. 303–315, 2016.
-  P. Tamilselvan and P. Wang, “Failure diagnosis using deep belief learning based health state classification,” Reliability Engineering & System Safety, vol. 115, pp. 124–135, 2013.
-  X. Li, W. Zhang, and Q. Ding, “Cross-domain fault diagnosis of rolling element bearings using deep generative neural networks,” IEEE Transactions on Industrial Electronics, 2018.
W. Zhang, C. Li, G. Peng, Y. Chen, and Z. Zhang, “A deep convolutional neural network with new training methods for bearing fault diagnosis under noisy environment and different working load,”Mechanical Systems and Signal Processing, vol. 100, pp. 439–453, 2018.
-  K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual category models to new domains,” in European conference on computer vision. Springer, 2010, pp. 213–226.
-  E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7167–7176.
-  G. Michau and O. Fink, “Unsupervised Fault Detection in Varying Operating Conditions,” in Proceedings of the 2019 IEEE International Conference on Prognostics and Health Management, 2019.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
-  S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan, “A theory of learning from different domains,” Machine learning, vol. 79, no. 1-2, pp. 151–175, 2010.
-  C. Cortes and M. Mohri, “Domain adaptation in regression,” in International Conference on Algorithmic Learning Theory. Springer, 2011, pp. 308–323.
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
-  W. A. Smith and R. B. Randall, “Rolling element bearing diagnostics using the case western reserve university data: A benchmark study,” Mechanical Systems and Signal Processing, vol. 64, pp. 100–131, 2015.