I Introduction
Rolling bearings, the most important mechanical components of rotary machines, are prone to fault for various reasons, like a harsh working environment and a long working period. The faulty bearing may harm mechanical equipment, which leads to catastrophic accidents. Accordingly, an accurate and effective diagnosis of bearing faults is essential for equipment’s reliable operation[wang2019new].
With the advancement of technology, many approaches for intelligent bearing fault diagnosis have been developed. Machine learning is one of them, in which extracted features from preprocessed data are fed into classifiers such as Support Vector Machine (SVM)
[wang2021modified], kNearest Neighbors (KNN)
[lu2021enhanced], and Random Forest (RF)
[wei2021intelligent] to classify fault types. The prerequisite of accurate predicting is when extracted features are considerably knowledgeable[r1, r2]. There are still drawbacks when there is a particular need for technical expertise in graph feature extraction and feature selection and a low capacity to learn nonlinearity and complexity in the data patterns.
[lei2020applications].Deep learning (DL) techniques have been employed in recent years to address the shortcomings of traditional machine learning methods. This technique has more dependable performance in feature learning and extracting more abstract features. Furthermore, no prior knowledge or experience is required for feature engineering by using endtoend learning algorithms[r3, r4]. Convolution neural network (CNN) [liu2020multitask], deep autoencoder (DAE) [liu2020stacked]
, deep belief network (DBN)
[xing2020distribution], and recurrent neural network (RNN)
[ravikumar2021gearbox] are the most commonly utilized DL techniques in intelligent fault diagnosis. A novel discriminant regularizer in DAE has been proposed in [mao2021new] to diagnose bearing faults. Deep residual CNN is used in [yang2020fault],in which the noise impact is decreased by employing a wide kernel in the first layer of convolution. Also, an attempt is made to reduce the gap produced by the distribution discrepancy between source and target data by utilizing the adaptive batch normalization method. Peng et al.
[peng2020multibranch] proposed a multiscale CNN to extract shorttime and longtime features and feature fusion to enhance model performance for fault diagnosis. Wang et al. [wang2021intelligent]introduced a normalized CNN to identify the fault, and the suggested model’s hyperparameters are optimized using particle swarm optimization.
If a substantial number of labeled data is available for model training, and the distribution of training and test data is the same, DL models can perform accurately. In contrast, the collection of labeled data, particularly faulty data, is not practicable in many realworld applications due to time and cost constraints. Consequently, the train and test data may have different probability distributions due to continuously changing operating conditions of rotary machines, resulting in poor performance and limited generalization of the DL models
[me]. Hence, it is essential to develop a strategy that alleviates the gap between training and test data distribution and enhances performance without creating a novel model for new unlabeled data. Over the past few years, the UDA technique has been utilized as a distinctive form of transfer learning in intelligent fault diagnosis, in which learned knowledge from labeled data in the source domain is transferred to the unlabeled data of the target domain through discovering domaininvariant and discriminative characteristics
[xu2021ifds]. Source and target domain are utilized in UDA approaches to training a model with shared weights, which can benefit from both. Moreover, it aims to learn the invariant characteristics across two domains. Coral[c], is one of the most effective UDA methods that has been widely utilized in unsupervised bearing fault diagnosis recently. This regularization method aims to align sources and targets’ secondorder statistics (batch covariances) with a linear transformation. An extended version of coral named deep coral is intergraded with a deep neural network that aims to learn a nonlinear transformation to align layer activations’ correlations
[dc]. RMCA1DCNN model based on Riemann metric correlation alignment loss is proposed in [cia] to obtain unsupervised fault diagnosis with domaininvariant and faultdiscriminative capacity. Furthermore, UDA follows to reduce the difference distance between distributions in latent space resulting from criteria such as the maximum mean discrepancy (MMD) and Multi kernel maximum mean discrepancy (MKMMD) [9146579, zhang2021joint]. For instance, the MKMMD is employed in [an2020deep] to decrease the distribution disparity caused by changing operating conditions. Lu et al. [lu2021new]introduced multilayer MMD to match the distribution between source and target. The objective of these approaches is to decrease the difference between the mean values of the distribution in the two domains, but data characteristics like median and the standard deviation in domains may not be the same after using MMDbased approach
[mao2020new]. This reduces the accuracy of classification in the target domain. It should be noticed that if the distribution discrepancy between the data is substantial, some critical data characteristics may be lost while projecting data from two domains to the same feature space. In a nutshell, applying the MMDbased approach is not sufficient individually to learn invariant features in two domains.The discriminative adversarial network for domain adaptation has been introduced with promising outcomes[zhang2021conditional]. The source and target domain distributions are aligned in this technique until the discriminator can determine whether the input data belongs to the source or target domain. Mao et al. [mao2020new] proposed adversarial domain training to identify faults under various working conditions. Xu et al. [xu2021intelligent] employed multilayer adversarial learning to adjust the domain and boost the model’s generalizability. Li et al. [li2020intelligent] proposed two feature extractors, one of which uses MMD techniques and the other adversarial domain training to extract invariant features, ensemble learning is used to improve the accuracy of fault diagnosis. Zhang et al. [zhang2019deep] introduced the multiadversarial domain adaption approach with the Wasserstein distance criteria to aid in diagnosing bearing faults under various operating conditions. The distribution discrepancy between the two domains in bearing fault diagnosis is considerable due to multiple factors such as changing operating conditions and environmental noise. Appropriately, if the distribution difference between the two domains increases, the convergence of adversarial techniques to a stable position will confront difficulties. Hence, providing an effective solution is critical to increasing the stability of adversarial techniques in bearing diagnosis. The domain label, the label for each category, and the data structure as three critical kinds of information represent an essential role in UDA techniques for bridging knowledge transfer from the source to the target domain[ma2019gcan]
. The domain label is used to train a domain classifier to represent the global distribution of both domains in adversarial UDA. If the label of each category for the source and target domain data samples is the same, it is expected to be mapped to the same feature space. The data structure comprises the data’s intrinsic characteristics, such as probability distribution and geometric structure of data. All three types of information can assist in alleviating the gap between the two domains under different operation conditions and improve the performance of the fault diagnosis model
[li2021domain]. As we know, most studies cover just one or two of these types of information and pays inconsiderable regard to the geometric structure of the source and target data. The data structure is considered as a grid in the mentioned models, which may be a limiting factor for the model’s generalization. On the other side, If we offer data with a graphbased structure as an input to a network like CNN, the desired results are not reached because these networks are built on gridstructured data. Furthermore, the domain adaptation method’s objective is to match the distribution of source and target domains. While each domain has multiple different subdomains, the UDA technique does not follow each subdomain’s adjusting distribution with the corresponding class. After adjusting the global distribution of two domains in the UDA, some unrelated data from different classes may be nearby to the data of a specific class in latent space, resulting in a decrease in domain adaptability and diagnostic accuracy. One solution to this problem is the LMMD technique [zhu2020deep], which is an extension of the MMD method. In addition to adjusting the global distribution between domains, the LMMD technique brings the distributions of each identical class as a subdomain in two separate domains closer together in latent space and adapts them.This paper proposes a unique DSAGCN as a graph convolution neural network (GCNN)based [gcnn]
solution and distribution discrepancy reduction employing LMMD and adversarial loss function to satisfy all of the limitations mentioned earlier. A CNN is used in DSAGCN to extract features from the vibration signal. The extracted features are input into several topology adaptive graph convolutional network (TAGCN)
[du2017topology] blocks of learning the geometric structure of the data, which are produced by investigating the relationship between the structural characteristics of the data and propagating structural information in the graph network’s parameters. Then, the LMMD loss function and the adversarial loss function jointly alleviate the structure discrepancy in domains distribution. Both of these criteria have advantages that have been described earlier. As a result, the structural information of the data is examined by employing feature modeling in the form of graphs. The classifier also models class labels, and the adversarial domain discriminator distinguishes the domain label for each data. Therefore, the suggested DSAGCN method incorporates all three elements that significantly increase UDA performance.The following are the major contributions of this study:

This paper offers a practical and comprehensive endtoend DSAGCN method to diagnose crossdomain bearing faults based on graph and domain adversarial discriminator and structured subdomain adaptation. In this proposed method, all information, including data structure, domain label, and labels of each class, are used in an integrated manner to minimize the distribution difference across domains and the distribution difference between relevant subdomains in latent space.

The GCN model proposed for this study is the TAGCN method. The simulation results indicate that this model can provide acceptable results by employing graphs with secondorder polynomials. Combining with CNN increases the model’s effectiveness in identifying bearing faults under different operation conditions.

Due to the substantial distribution difference between the source and target domain through changing load while collecting the vibration data, LMMD and an adversarial domain discriminator were employed concurrently to find domaininvariant and discriminative features that would reduce the distribution difference between the two domains and align them. The ablation study demonstrates that the LMMD applied in DSAGCN outperforms coral and other MMDbased loss functions in average accuracy and convergence speed.
The rest of this paper is structured as follows: The fundamental theory of the approaches utilized in DSAGCN is explained in Section II. Section III describes the structure and characteristics of the proposed DSAGCN method. In Section IV, after introducing the dataset used to evaluate the model, the experimental results of the proposed technique are compared with other comparative methods to assess the benefits of the DSAGCN method. Finally, Section V provides the conclusion.
Ii Preliminaries
Iia Graph Convolutional Neural network
CCNs have essential properties such as local connection, weight sharing, and invariance with shifts, which allow them to be used in various disciplines such as image processing, fault diagnosis, and data with uniform and gridbased structures [chen2021multiscale]. However, Conventional CNN methods cannot achieve an accurate result in many cases, such as biological systems, social networking applications, and fields where the data has a nonEuclidean and irregular structure[cheung2020graph]. Graph structure outperforms conventional CNNs in these applications. Furthermore, knowing the geometric structure of the data improves model learning and lowers the distribution gap across domains in transfer learning fields. Graph signal processing (GSP) [ortega2018graph] is used to alter conventional CNNs by leveraging graph theory, which provides both a general framework and a rigorous view for GCNN. A conventional CNN and a GCNN are compared in Fig. 1. Conspicuously, The convolution operator computes the sum of pointwise multiplications of a subset of input by the kernel of that layer. This process is repeated until the kernel has moved through the input. Fig. 1 demonstrates the GCNN architecture and how it differs from a conventional CNN for inputs with irregular structures. A 1degree filter is shown, with the blue vertex collecting information about the red adjacent vertices A 1degree filter is shown, with the blue vertex collecting information about the red adjacent vertices.
A graph is commonly expressed as , where is the set of N vertices, denotes the set of edges connecting the two corresponding vertices, and expresses the adjacency matrix that shows how each vertex is related to the other vertices. Each term represents the weight of the connected edge between i and j. The graph Laplacian is defined as , where L is the graph Laplacian and is the degree matrix of [yu2021fault]. Every graph has the option of being directed or undirected. The adjacency matrix is symmetric in an undirected graph since the path between the two vertices is common, but it can be asymmetric in a directed graph due to the path’s direction. Any graph can either be weighted or unweighted. GCNN techniques are broadly classified into two types: spectral domain and spatial domain [li2021fault]
. spectral domain technique utilizes the Fourier transform of the graph and the generalized Laplacian operator in the spectral approach. The spatial domain approach does not employ the Fourier transform, instead of relying on GSP’s definition of graph convolution and the concept of graph shift
[sandryhaila2013discrete]. This concept describes how to propagate information from one node to neighboring nodes and replace each node’s signal value with a linear combination of signal values in that node’s neighborhood. Shift operator is a vital component in GSP. For example, the adjacency matrix as a shift operator and as a graph signal, the onestage propagate is the new graph signal , and the nstage propagate is the signal .TAGCN technique [du2017topology] is one of the most acceptable spatialbased methods in GCNNs. A graph convolutional layer is generated in this technique by integrating the concepts of graph convolution and graph shift. The operation of graph convolution on the Llayer convolutional is described without loss of generality. Assume that each vertex of the graph has the
feature input in this layer, and the vector
contains the th layer’s input data for the th feature of all vertices. The graph indexes the elements . Assume that represents the GCNN’s th filter in the th layer. Multiplying the matrix produces the graph convolution. The th output of the convolution is calculated using the following equation:(1) 
where is the bias and is a square matrix with one element with dimensions equal to the number of graph vertices. can be obtained as a polynomial in :
(2) 
In this equation, and
are the polynomial coefficients of the graph filter and the normalized adjacency matrix, respectively. Normalizing the adjacency matrix ensures that all eigenvalues are localized inside a unit circle, resulting in computational stability
.The TAGCN method as a spatialbased method can be compared with spectrum methods in different aspects. Basically, The TAGCN technique uses a series of learnable node filters with fixed sizes ranging from size 1 to size to perform graph convolution. Furthermore, it has O(K) learning complexity, similar to conventional CNN algorithms. TAGCN has a reduced computing burden than spectrumbased techniques, which spectrumbased techniques have a more significant computational burden owing to Fourier transform, inverse Fourier transform, and Eigen decomposition. Aside from the reduced computing burden, the TAGCN technique also has lower computational complexity than spectrum methods. Secondorder polynomials can produce good results in this approach, whereas 25order polynomials from the graph adjacency matrix are required in [defferrard2016convolutional]. Fig. 2 depicts the TAGCN polynomial filter for . The selected and graph direction indicate the connections for information propagation and collection from the orange nodes to the jade node in Fig. 2. This graph demonstrates the TAGCN approach’s decreased computational complexity compared to other methods, particularly the method provided in [defferrard2016convolutional]. In addition, unlike the study in [kipf2016semi]
, no linear estimation is used in the TAGCN, leading to data loss and lower classification accuracy. TAGCN uses GSP theory to construct filters ranging from 1 to
to avoid this problem. The spectral technique is unusable for directed graphs because the spectral technique utilizes shift graph operators such as graph Laplacian, which is only relevant to undirected graphs, and since the graph Laplacian must be positive semidefinite in order to apply this approach, which is only possible with the symmetry of matrix A. Unlike spectrumbased strategies, each convolutional layer in the TAGCN technique is unique due to variablesize graph convolutional layers. The Fourier transform will not be unique in spectrum methods if the eigenvalue obtained from the Laplacian is repeated.IiB deep structured subdomain adaptation
DL methods mostly assume that the distribution of test and training data is homogeneous, and if this null hypothesis is false, DL method performance will be significantly reduced. While the distribution difference between test and training data is seen in many applications. The domain adaptation technique is presented as a solution for this category of issues, in which learning the invariant feature space in the source and target domains leads to a reduction in the difference in domain distribution and an increase in model’s performance. A domain is defined in domain adaptation as
, where and are the data feature space and their marginal probability distribution, respectively, and are instance samples. and denote the source and target domains, respectively. An unsupervised domain adaptation problem is described as having Labeled source domain and unlabeled target domain data . The labeled space is denoted by . In this scenario, it is assumed that the feature space, category space, and probabilistic conditional distribution of these two domains are the same, but their probabilistic marginal distribution is different owing to the domain shift. Domain adaptation aims to predict the target domain data labels using transferred information from the source domain [zheng2019cross]. The global domain adaptation technique’s core idea is to roughly adjust the global probability distribution of the source and target domains. Despite the benefits of this technique, some irrelevant data from one class may be close to another one in feature space after aligning distribution, and DA performance suffers due to a lack of relationships between subdomains with the same class in different domains [zhu2020deep].The subdomain adaptation technique is an efficient solution to avoid this problem. The probability distribution of the same classes in different domains and the discrepancy between them are considered in this method. In addition to aligning the distribution of two domains, this strategy adjusts the distribution of subdomains with the same classes closer together in the feature space. Fig. 3depicts a comparison of the DA technique and the subdomain adaption method. The relationship between the data in these two domains must be utilized to split the source and target domains into multiple subdomains containing data of the same class. Nevertheless, since there are no labels in the target domain in unsupervised learning problems, the network output should be employed as the target domain data’s pseudolabels. As an outcome of the use of pseudolabels, the source and target domains are divided into multiple subdomains
and with probability distributions and , respectively, where is the number of classes. The suggested approach for subdomain adaptation is the LMMD technique [zhu2020deep], an extension of the MMD method that determines the distribution discrepancy of each same subdomain in different domains. The most commonly exploited nonparametric distance metric for DA is MMD, which measures the difference between two domains distribution in reproducing kernel Hilbert space (RKHS). MMD between and is described by the following relationship:(3) 
In this context, is an RKHS, and is a nonlinear mapping that converts data from and to RKHS feature space. To make computations easier, utilize the kernel characteristic , which is defined by the relation , where
is the inner product of the vectors. As a result, an unbiased estimate of
Eq. 3 equals:(4) 
In this case, and represent the number of source and target samples, respectively. As previously stated, despite the MMD technique’s effectiveness in measuring the distribution difference between two domains and playing an essential role in the regularization term of the loss function in some applications, there remains a need for a method to compute the distribution difference between each subdomain. The LMMD approach is employed in this study as generalized criteria of MMD to determine this discrepancy, which is defined as follows:
(5) 
Each sample in each class is supposed to be allocated a weight . The relationship mentioned above will be as follows in this case:
(6) 
Where and represent the weight of and pertaining to class m, respectively.It should be noted that and for the sample is calculated as:
(7) 
requires and values, and as previously stated, a data label is provided, but must be determined using a pseudolabel.
can be calculated by obtaining for each sample. Eq. 6 can be modified to adjust the source and target domain features in the layer where is the lth layer activation of L layer.:
(8) 
IiC Domain Adversarial Network
The adversarial domain adaptation network is one of the most famous fault diagnosis architectures. Applying This technique in unsupervised fault diagnosis has increased due to its powerful ability to reduce the distribution discrepancy between data in different domains [huang2020deep]. The architecture contains two incorporating networks; a feature extractor and a domain discriminator network. On the one hand, the Feature extractor network aims to extract knowledgeable and discriminative features from the source and target domain. On the other hand, A domain discriminator network distinguishes whether extracted features come from the target or source domain. It inherits a binary classification algorithm to learn a logistic regressor for mapping features between [0,1]. Robust knowledge of source features maps into approximate target maps by fooling the feature extractor with a reverse gradient layer. Consequently, a classifier can use the knowledge to define unlabeled fault types in the target domain [yu2020conditional]. As a final point, The Optimization objective as a binary cross entropy loss of the adversarial domain network is shown as follows:
(9)  
where G(.) is the feature extractor, and D(.) accepts 0 or 1 as an input.
Iii Proposed Method
In this section, we discuss the detailed architecture of the proposed method in Fig. 4. As shown in Table I, We divide our model into graph feature extraction, domain adaptation, and classifier networks, which is detailed below:
Iiia Graph Feature Extraction
We employed a fivelayer CNN with a wide kernel in the first part of the feature extraction network for longer dependencies. As we proceed deeper into the layers, the kernel size decrease, improving local graph feature extraction and feature representation. Furthermore, using a wide kernel rather than the small one in the beginning layers causes highfrequency environmental noise to be repressed in input data to have a more robust network in the classification task [zhang2018deep]. Directly after the convolution operation, the batch normalization technique is used to accelerate the network’s training and decrease the shift of internal covariance [bt]
. Rectified linear unit (ReLU) is utilized as an activation function to improve the representation ability and learn the complex pattern in the data
[zhang2018deep]. We used two kinds of pooling layers to reduce the network’s parameters. Accordingly, the maxpooling layers directly after Relu activation layer and adaptive maxpooling are employed at the top of the CNN for a given reduced fixedsize output dimension. In the next stage, we reduced the dimensions of the feature vector by a dense layer
) with 256 neurons to have better robust feature representation. As the input of TAGCN is structured, it is necessary to transform unstructured features to structured ones, the output of the CNN network passes to a graph generation layer (GGL)
[li2021domain] for producing structured graph data, and then TAGCN is applied to structure features. Each feature vector is specified as a node whose feature vectors come from the output of a dense layer, and an adjacency matrix is defined by the multiplication of the features vector and its transpose, as shown below:Network  Layer  Kernels Size /Stride/Filter Number 
Output size  Pooling Size  Padding 
Graph Feature Extractor  Conv1D, BN, Relu, Max Pool  128/1*1/16  N*1024*16  2*2  Yes 
Conv1D, BN, Relu, Max Pool  64/1*1/32  N*521*32  2*2  Yes  
Conv1D, BN, Relu, Max Pool  32/1*1/64  N*256*64  2*2  Yes  
Conv1D, BN, Relu, Max Pool  16/1*1/128  N*128*128  2*2  Yes  
Conv1D, BN, Relu, Adaptive Max Pool  3/1*1/128  N*4*128  32*32  Yes  
FC1, Relu, Dropout 0.5  256 neurons  N*256*1  NO  
GGL, Dropout  None  N*N  NO  
TGACN, BN  128 neurons  N*128  NO  
TGACN, BN  256 neurons  N*256  NO  
Domian Discrimnetor  FC2, Relu, Dropout 0.5  128 neurons  N*128  NO  
FC3,Relu, Dropout 0.5  128 neurons  N*128  NO  
FC4, Sigmoid  1 neurons  N*1  NO  
Classifier  FC5, Softmax  Number of Fault types neurons  N*10  NO 
(10) 
(11) 
(12) 
Where represents the normalization function. It is beneficial to make the adjacency matrix sparse to avoid computational costs. Therefore, as detailed below, K(.) returns the klargest elements of the given adjacency matrix. Finally, Eq. 13 delivers a sparse adjacency matrix by indexing the topk largest values of at rowwise.
(13) 
The generated graph will be fed through graph convolution layers in the next stage, extracting node features and aggregating neighbors’ information. For better feature representation and robust aligned structured features, we used two TAGCN layers with the number of hops of two (). The output of each layer directly is connected to graph batch normalization layers as described in below:
(14) 
(15) 
Where shows normalized structured features; and
are expectation and variance functions, respectively;
and are trainable parameters; is added for numerical stability. To sum up, assuming Eq. 10 to Eq. 15, we considered CNN and GCNN network as a graph feature extractor network which is indicated by , where it takes input data and returns structured features.IiiB Domain Adaptation Networks
As stated, we used two separate networks for aligning target and source features and reducing distribution between features in latent space, which are discussed in two following subsections:
IiiB1 Domain Discriminator
Adversarial domain learning tries to identify the label of extracted features supplied to the domain discriminator, as been described in Section II. For producing invariant features in the target domain, reducing the loss computed inversion gradient by gradient reversal layer (GRL) in Eq. 9. We used FC2 and FC3 as a backbone of the domain discriminator network to produce more robust invariant latent space features. As a result, the classifier will adopt either health states from the target domain or the source domain.
IiiB2 LMMD Loss
It is used at the top of the graph feature extractor part to reduce the distribution between extracted structured features. As a nonparametric technique, LMMD aims to match the distribution of source and target features and reduce the discrepancy distribution of relevant subdomains by integrating deep feature adaptation and feature learning. As been detailed in Eq.
8, the radial basis function (RBF) kernel is chosen as a kernel.
IiiC classification layer
We used a fully connected layer whose number of neurons equals the number of health conditions. The Softmax classifier is adopted as an activation function. Another usage is to produce pseudo labels in the target domain for calculating LMMD loss. By defining , the objective function of classification is defined as:
(16) 
(17) 
(18) 
Where E(.) is considered mathematical expectation; and
are generated logits by classification layer, respectively.
Algorithm: DSAGCN  

Require: Raw Input Data, Learning Rate (), Tradeoff Parameters( , )  
1.  Define labeled source and unlabeled target dataset through preprocessing raw data  
2.  Initialize model’s weights Using Xaviar initialization  
3  Consider Input  
For do :  
4. 


5. 


6. 


7. 


Until , , converge 
IiiD Objective Function
Considering classification loss (Eq. 18), structured subdomain loss (Eq. 6), and adversarial loss (Eq. 9 )total loss of DSAGCN method with combining the defined threeloss described as follows:
(19) 
Where and are tradeoff hyperparameters. The algorithm of the proposed DSAGCN is summarised in the table II. It is shown that after producing total loss by network parameters of each network is updated with backpropagation operation and will be converged, which are detailed as below:
(20)  
Where , , and are parameters of graph feature extractor, domain discriminator, and classification layer, respectively; denotes partial derivative function; shows learning rate.
Iv Experiments
The effectiveness of the proposed model is investigated in a variety of ways for two wellknown datasets, including the CWRU [loparo2003bearing] and Paderborn [lessmeier2016condition] bearing datasets, in order to assess the validity of the proposed DSAGCN method in diagnosing bearing faults under various operating conditions.
Iva implementation details
The Xavier initializer [datta2020survey] is used during the training phase to set the DSAGCN method parameters The initial learning rate of DSAGCN is set at 0.001. The Adam optimization technique [kingma2014adam] is employed to optimize the parameters, and each batch has a length of 128. The optimal values of the tradeoff parameters and are selected as 0.5 and 1, respectively. In addition, The degree of polynomial filter in the graph used in the DSAGCN method is considered equal to 2. Each experiment is repeated ten times to decrease the results’ randomness. Fault diagnosis’s average accuracy is used as assessment criteria.
IvB compared approaches
To demonstrate the superiority of the proposed DSAGCN method over existing techniques, ten comparative methods including SVM, CNN, JDA [long2013transfer], CORAL [sun2016return], DANN [ajakan2014domain], unsupervised domain adaptation convolution neural network (UDACNN), Baseline, graph convolution maximum mean discrepancy (GCMMD), graph convolution multikernel maximum mean discrepancy (GCMKMMD), and graph convolution CORAL (GCCORAL) have been implemented. All of these techniques use the same hyperparameters and settings as the DSAGCN method. These methods can be divided into the following categories:

Traditional approaches: SVM technique with radial basis function (RBF) kernel is used to evaluate the effectiveness of the proposed method compared to the traditional supervised method. Six features are chosen as the SVM model’s input, similar to [xu2020transfer]. The SVM model is trained using extracted features from the labeled source domain, and the model is tested with unlabeled data in the target domain.

DLbased approaches: The CNN network of the proposed graph feature extraction plus the classification layer is used as a CNN model compared with the proposed method in this study. Labeled source domain is utilized for CNN model training, whereas unlabeled target domain data is used exclusively for model testing. This model does not involve domain adaptation, and the only loss function is crossentropy.

DAbased techniques: Four domain adaptation approaches, including JDA, CORAL, DANN, and UDACNN, were utilized to demonstrate the superiority of the proposed method. The structure of the UDACNN approach is similar to the CNN model. Also, it incorporates two domain adaptation modules, including the domain discriminator and LMMD, similar to the DSAGCN method. The only difference between the UDACNN and DSAGNN models is the lack of a graph; all other variables and hyperparameters are the same.

Graphbased techniques: We considered either the backbone of DSAGCN method domain with discriminator network in GCMKMMD, GCMMD and GCCORAL or the Baseline method without them to compare the role of structure subdomain adaptation with other domain adaptation with nonparametric loss in our proposed method. As previously stated, the baseline model includes the feature extraction and classification layer parts, but unlike the CNN model, it also includes a graph. In addition, unlike DSAGCN, it does not employ any domain adaptation modules. In GCMKMMD, GCMMD, and GCCORAL, the adversarial domain discriminator loss function and the crossentropy loss function are employed, and Their difference is in the third loss function. In contrast to DSAGCN, the MMD loss method is applied in GCMMD instead of LMMD. Also, for GCMKMMD and GCCORAL methods, the multiGaussian kernels with a mixture of five different bandwidths and the CORAL loss function are utilized, respectively. These three techniques aim to decrease the distribution difference between domains by applying three loss functions simultaneously and improving classification accuracy.
Working Condition  Load  Types of Fault  Fault Diameter (mils) 

A  0 hp  N,IRF,ORF,RF  7,14,21 
B  1 hp  N,IRF,ORF,RF  7,14,21 
C  2 hp  N,IRF,ORF,RF  7,14,21 
D  3 hp  N,IRF,ORF,RF  7,14,21 
IvC Case I: CWRU Expriment
IvC1 data description
The CWRU bearing vibration dataset is utilized to assess the efficiency of the suggested DSAGCN method, which the test platform depicted in Figure 4. In this work, the CWRU bearing vibration dataset is utilized to assess the efficiency of the suggested DSAGCN method, which the test platform depicted in Fig. 5. Vibration data are obtained using an accelerometer with a sampling rate of 12kHz located at the motor’s drive end. Data are gathered under four different operating conditions caused by load changes ranging from 0 to 3 hp. This dataset considers four different bearing health modes: Normal, Inner Race fault (IRF), Outer Race Fault (ORF), and Roller Fault (RF). Each of them has three different severity faults with diameters 7, 14, and 21 mils. Therefore, this dataset has ten different states of bearing health status, which is given in Table III
. Data augmentation is used to enhance the number of samples, which improves the performance of the training process. As a result, a sliding window with a length of 1024 data points and a time step of 475 is employed. Finally, there are 200 training and 50 test samples for each motor load after shuffling samples. There are 12 transfer learning tasks based on four different loads. Each task is divided into two parts: the source and the target. For instance, in task B→D, the labeled data from dataset D is used as the source domain, whereas unlabeled data from dataset B is used as the target domain. The DSAGCN method is trained on the CWRU dataset across 100 epochs.
IvC2 CWRU bearing fault diagnosis Result and Discussion
Fig. 6 depicts the simulation results for various tasks using the DSAGCN method and comparison approaches. The collected results can be categorized as follows:

It is evident from the results that the DSAGCN method outperforms the other comparison methods in unlabelled fault diagnosis. The DSAGCN method achieves higher accuracy by utilizing the geometric structure of data in graphs, matching each structured subdomain using the LMMD algorithm, and minimizing the distribution discrepancy between domains using the adversarial loss function. As a result, because it covers all three types of essential information for the UDA technique, the DSAGCN method surpasses the three traditional, DLbased, and DAbased procedures. Furthermore, the DSAGCN method outperforms graphbased techniques in terms of accuracy, emphasizing the significance of deep structured subdomain adaptation methods.

In one of the most challenging tasks,B→D, the worst fault diagnosis accuracy is 57.95%, while the best accuracy is related to the proposed DSAGCN method with 99.03%. The DSAGCN method’s fault diagnosis accuracy is 0.02%, 1.65%, and 16.16%, greater than the best accuracy of the three categories, graphbased, DAbased, and DLbased. This high accuracy demonstrates the DSAGCN method’s efficiency when there is a substantial distribution difference between domains.

In graphbased techniques, GCCORAL, GCMMD, and GCMKMMD have higher average accuracy than DAbased methods, demonstrating the necessity of using hybrid DA tools and the importance of data geometry in approaching the distance between two domains. As can be observed, the geometric structure of the data is an essential aspect in improving the UDA’s performance. Also, this group of approaches has been offered as an ablation study. In contrast to the DSAGCN approach, GCMMD, GCMKMMD, and GCCORAL employ MMD, MKMMD, and CORAL instead of LMMD. The findings demonstrate the superiority of the suggested strategy and the effectiveness of utilizing LMMD.

Due to a lack of domain matching ability to decrease the distribution difference, the baseline technique has the lowest average accuracy compared to other graphbased methods and the suggested DSAGCN method. The baseline technique outperforms CNN in terms of average accuracy, indicating the efficacy of the graph and the geometric structure of the data in lowering the distribution disparity. The baseline approach is less accurate than the UDACNN, meaning that the geometric data structure is not powerful enough to alleviate the distribution difference without other DA techniques. Performance is when improved LMMDbased and adversarial domain adaptation combination techniques are used.

The performance of the DSAGCN method, graphbased, and domainbased techniques outperform the performance of the other two groups. The SVM approach does not perform as well as other methods since it does not use all of the key information of vibration data owing to reasons like the selection of handmade features extracted as input, traditional machine learning networks’ low performance in identifying complicated nonlinear relationships between data, and training the model only using Source data. When compared to other methods, it has the worst average accuracy. Although the CNN model outperforms the SVM model due to deep learning networks’ capacity to analyze nonlinear connections and automated feature learning, it has lower classification accuracy than the graphbased and UDACNN methods, due to reasons such as discrepancies in distribution between training and test data, a lack of adequate tools to decrease distribution disparities, and a failure to pay attention to the varied geometric structure of training and test data. The accuracy of the CNN model demonstrates that the trained model in a given load does not have excellent diagnosis accuracy in classifying bearing faults in other loads. The features acquired in the primary layers of the CNN model are more generic and can be utilized for data with various distributions. However, as they progress to the end layers, the features become more particular, necessitating the employment of a tool such as DA to transmit information from the source domain to the target domain, decrease distribution discrepancies, and consequently enhance fault detection performance.
Dataset  Faulty condition  Working Condition  

Rotational Speed (rpm)  Load Torque (Nm)  Radial Force (N)  
E  N,IRF,ORF  900  0.7  1000 
F  N,IRF,ORF  1500  0.1  1000 
G  N,IRF,ORF  1500  0.7  400 
H  N,IRF,ORF  1500  0.7  1000 
The tdistributed stochastic neighbor embedding (tSNE) approach [van2008visualizing] is applied to intuitively understand the suggested DSAGCN method’s efficiency in reducing the difference in the distribution of learned features across two domains and aligning relevant subdomains with the same class in two domains. This method converts learned features with high dimensions to 2D feature space. Fig. 7 depicts the tSNE representation of the task as the most challenging diagnostic task in fivestep in the DSAGCN method. Fig. 7a exhibits the model’s input data, which reveals that the data from various classes are mixed and not proper for classification. The output of the adaptive maxpooling layer, which represents the CNN part of the model, is shown in Fig. 7b. The distribution disparity between the source and target domains is visible in this figure. In fig. 7c with the inclusion of the FC1 layer in the continuation of the CNN section, the segregation performance is better than that of CNN, but the data of the distinct categories is still incorrectly classified. For example, IRF class data with a severity of 0.021 mils are not adequately distinguished from ORF class data with severity of 0.021 mils, which has a detrimental impact on fault diagnosis performance. The data of various classes are well segregated from each other in Fig. 7d, which visualizes the output of the TAGCN layer, although the disparity between the distributions between domains is substantial. Domain adaptation is employed to tackle this issue, as seen in Fig. 7e. This figure indicates the beneficial effect of subdomain adaptation on reducing distribution disparities and improving class separation.
IvD PU Expriment
IvD1 data description
The experimental data are gathered from the Paderborn university rig test[lessmeier2016condition], which included five electric motor components: a measuring shaft, a rolling bearing module, a flywheel, and a load motor. Fig. 8 depicts a modular test rig for this dataset. It considers three different bearing health conditions: Normal, IRF, and ORF. Two artificial and real damage are supposed to generate the faults types. Given that two distinct values for rotational speed, load torque and radial force, were examined in this study. Table IV investigated four different working conditions to assess fault diagnosis performance under various operating conditions. Nine distinct classes K001, K003, K005, KA04, KA16, KA22, KI04, KI14, and KI16, are conducted according to[lessmeier2016condition], to assess the proposed method. All selected ORF and IRF faults are chosen from the types of real bearing damage generated by accelerated lifetime testing. An accelerometer sensor with a sampling frequency of 64 kHz is used to collect vibration data. A 1024 datapoint sliding window with no overlap is employed to divide the data. Finally, for each motor operating condition, 200 samples for the train and 50 samples for the test are provided. On the Paderborn dataset, the DSAGCN method is trained for 400 epochs.
IvD2 Paderborn diagnosis Result and Discussion
Fig. 9 depicts the simulation results of the proposed DSAGCN method and comparison methods for the unsupervised fault diagnosis issue for the Paderborn dataset under different working conditions. The gathered information can be divided into the following categories:

When evaluating the average accuracy of fault diagnosis, the suggested DSAGCN approach has the highest accuracy compared to other methods. The average accuracy of the DSAGCN method is 0.92%, 24%, 18.81%, and 40.61%, more accurate than the highest.

In one of the most challenging transfer tasks, G→E transfer, the DSAGCN method has 0.84% and 34.06% better accuracy than the maximum and minimum diagnosis accuracy in comparative methods, respectively. When there is a significant distribution gap among domains, this high accuracy indicates the effectiveness of the DSAGCN approach.

The explanation for reduced fault diagnosis accuracy in some tasks, such as F→E, compared to other tasks is a change in working conditions and more severe differences in distribution between domains. In four tasks, the average accuracy of the DSAGCN approach was less than 90%.

The mean accuracy of the DSAGCN method and other comparison methods for the Paderborn dataset is much lower than the results obtained for the CWRU dataset in Fig. 6. The suggested DSAGCN method’s average accuracy for the Paderborn dataset is 6.7% lower than the CWRU dataset due to the Paderborn dataset’s greater complexity than the CWRU and the significant distribution differences between domains in the Paderborn tasks. In the CWRU dataset, only the load changes, and the rotational speed is relatively constant. In contrast, in the Paderborn dataset, there is the possibility of changing the rotational speed and radial force, which causes more differences between distributions and reduces the accuracy of fault diagnosis under different operating conditions.
In addition to analyzing numerical results to examine the effectiveness of the proposed method, this part provides a visual representation of tSNE in five different layers, including input data, CNN layer, FC1 layer, TAGCN layer, and classifier layer. The H→G diagnostic task is randomly chosen, and the tSNE is displayed in Fig. 10. As can be seen, the gap between the source and target domains gets less as we get closer to the end layers, and data of the same class in the two domains get closer to each other in latent space. Due to domain adaptation modules in the end layers, the difference between the domains is reduced, and the domains are better aligned. However, a small amount of data are wrongly placed near another class. For example, some data of the k001 class are mixed up with the k003 class, lowering the accuracy of this task’s fault diagnosis.
IvE Model Discussion
IvE1 degree of graph’s polynomial filter
One of the crucial parameters of the suggested approach is the value of , which is the polynomial degree of the graph filter in the proposed DSAGCN model. In this subsection, values of {1,2,4,5,10,25,100} were considered, and the criterion of greatest accuracy and lowest training time was studied to achieve the ideal value of . Table V illustrates the accuracy and training time for CWRU and PU datsets for various values of in a single epoch. the B→D and H→G tasks are shown as selected task in TableV, respectively. The model was trained ten times for each value of to decrease randomness, and the average accuracy and training time were considered. According to Table V, 100% accuracy is achieved for to diagnose the fault in the CWRU dataset; however, as the value of increases, the training time increases owing to the increasing complexity and computational burden. Because of the rising complexity, diagnostic accuracy decreases dramatically for . Table V demonstrates that the highest accuracy of fault diagnosis is obtained for . As the value of increases, the accuracy declines substantially, reaching an accuracy of 16.25% for , which is less than the accuracy of all comparison approaches. As a result, the optimal selection of the value in the suggested model is crucial. According to the results obtained on both datasets, is the best choice for the suggested model, with the maximum accuracy and a short training time.
CWRU Dataset  K  1  2  4  5  10  25  50  100 
Accuracy (%)  70.44  100  100  100  100  99.8  83.4  66  
Time (S)  391  396  427  444  536  815  1273  2176  
PU Dataset  Accuracy (%)  84.44  98.63  87.42  85.55  68.34  50.6  25.33  16.25 
Time (S)  1414  1458  1523  1603  2018  3071  4772  6023 
IvE2 Convergence performance
Fig. 11 depicts a representation of the total loss function and diagnostic accuracy per epoch in the B→D and H→G tasks on test data for the CWRU dataset and PU dataset, respectively. In this figure, the suggested DSAGCN method is compared to the three techniques GCMMD, GCMKMMD, and GCCORAL to assess the effectiveness of the LMMD method. As can be observed, as the number of epochs rises, the total loss rate drops while the accuracy rate improves. Furthermore, the DSAGCN method converges to its lowest loss function faster than other approaches, indicating that this model is better and more efficient than other methods in terms of convergence. In terms of fluctuations, the DSAGCN method has the lowest fluctuations compared to other methods, and it has a smoother curve, demonstrating the approach’s high stability. In conclusion, Fig. 11 demonstrates the superiority of the suggested approach and the perfect performance of the LMMD technique when contrasted to CORAL, MMD, and MKMMD.
IvE3 coefficient of the loss function:
Six distinct values {0,0.01,0.05,0.1,0.5,1} were used as and coefficients to investigate the influence of the coefficients of two adversarial and LMMD Loss functions on the fault diagnosis outcomes. Fig. 12 shows an example of the B→D transfer task from the CWRU dataset. As can be observed, the model has high accuracy for different values of at = 0.5, = 1, indicating the strong inﬂuence of the LMMD approach. = 0.5 and = 1 are the most optimal values of and . For values of = = 0, the domain adaptation modules have no impact, and the mode is equivalent to the baseline method.
V conclusion
This paper presents a novel endtoend DSAGCN approach and a new framework for bearing fault diagnosis under diverse operating conditions based on crucial information such as data geometry. Adversarial domain adaptation and structured subdomain adaptation are employed to decrease the distribution discrepancy. Also, the geometric structure of data is obtained using graph theory in this study. The suggested model’s graph, which has a lower computing complexity than the spectrum technique and no linear approximation in its calculation, considers the geometric structure of the data under different working conditions and aids in diagnosing bearing faults. On the other hand, the adversarial domain discriminator is used to align the domains, and the LMMD loss function is utilized to match subdomains with the same class. Experiment results on the CWRU and Paderborn datasets demonstrate that the DSAGCN method outperforms other approaches in terms of fault diagnosis accuracy under various operating conditions.