A common assumption in machine learning is that the training and the test data are sampled from the same distribution. However, in many practical scenarios the distribution of data samples in the test stage may deviate from that in the training phase. Domain adaptation methods aim to provide solutions to machine learning problems by dealing with this distribution discrepancy. In domain adaptation, a source domain and a target domain are considered where the label information is mostly available for the data samples in the source domain, and few or none of the class labels are known in the target domain. The purpose is then to improve the learning performance in the target domain by making use of the label information in the source domain as well as some presumed relation between the source and the target domains.
. Another common solution is to align the source and the target domains through feature space mappings. Many methods learn a classifier in a joint feature space, which may be obtained via projections or transformations[19, 22, 67]; kernel space representations [39, 21]; or deep networks [20, 30].
All these methods share the same property that they strictly depend on feature space representations of data. The common effort in these works is to successfully align the source and the target distributions via mappings or transformations based on certain techniques and assumptions. Such transformations and mappings are relatively easy to compute when the deviation between the source and the target distributions is small. On the other hand, when the source and the target distributions differ significantly, feature space methods often suffer from some performance loss as they get stuck in local optima or the intricacy of the actual transformation between the two domains is beyond the representation power of the transformation models they employ. Moreover, in certain machine learning problems involving classification or regression on platforms such as social networks, feature space representations may even not be available as data samples correspond to users or graph nodes, rather than vectors in an ambient space.
In this work, we propose a domain adaptation method that aims to overcome such shortcomings. The proposed method is purely based on graph representations of the source and the target domains. Except for a possible initialization stage of constructing the source and the target graphs from data (in case the graphs are not available beforehand), it does not use the feature space representations of data and treats the domain adaptation problem in a pure graph environment. In contrast to the aforementioned methods relying on feature space representations, graph methods provide considerable flexibility in challenging setups where the two feature spaces are hard to explicitly align due to high data dimensionality, the nonlinearity and irregularity of the actual warping between the two domains, large distribution discrepancy, etc.
In the proposed approach, the source and the target domains are represented respectively with a source graph and a target graph. We assume that many labels are available in the source domain, while few or no labels are available in the target domain. The computation of the class labels in the target domain is then cast as the estimation of an unknown label function on the target graph. Our assumption about the relation between the source and the target domains is that the variation of the class label function bears similar characteristics between the source and the target graphs: In a practical machine learning problem, different classes may have different characteristics regarding their localization, spread, and separability from the other classes. This is illustrated in Figure 1, where the green class has wider spread than and better separation from the other two classes, while the samples from the red and the blue classes tend to be more localized and entangled with each other. Then, assuming that such class-specific characteristics are common between the source and the target domains, our purpose is to obtain an accurate estimate of the class labels on the target graph by making use of this information.
Our solution is based on the idea of representing the source and the target label functions in terms of a set of localized signals on the two graphs. The extension of classical signal processing techniques to graph domains has been an emerging research topic of the recent years, during which graph equivalents of concepts like harmonic analysis and filtering have been proposed and gained wide recognition [46, 6, 4, 15]. In our work, we choose to represent class label functions in terms of graph wavelets . The spectral graph wavelets proposed in  inherit their distinguishing characteristics such as adjustable scale and localization from their counterparts in classical signal processing theory. Graph wavelets provide quite useful representations for our graph domain adaptation problem since we are interested in capturing class-specific variations of the label function in relation to e.g. how localized each class is and how fast the label function tends to change in different regions of the graph.
A mild assumption of our method is the availability of a small set of matches across the source and the target graphs, where a “match” refers to a source-target data sample pair belonging to the same class and known to be related in a particular way depending on the problem, such as originating from or being associated with similar data resources. For instance, in a text categorization problem where the source and the target domains consist of articles written in two different languages, a pair of text feature vectors representing the same article written in a source language and translated into a target language may constitute a “match”. Similarly, in an application where the source and the target domains contain respectively image and text samples, a “match” may consist of a pair of image and text feature vectors that are extracted from the same resource, such as the same web page. Although its reliance on the availability of a set of matches may seem to be a limitation of our framework, our algorithm often requires a small number of matches to attain satisfactory performance. Also, in many practical settings, even if such matches are not available beforehand, it is often possible and easy to form a small subset of such matches and inject them into the data set.
We formulate the connection between the source and the target domains via graph wavelets localized around the matched nodes across the two graphs, which serve as “anchor” points for characterizing and sharing the class-specific, local behaviors of the label functions over the source and the target graphs. The information related to the behavior of the source label function is transmitted to the target graph through its projection onto the graph wavelet functions as depicted in Figure 2. Our formal problem definition then becomes the following: Given a source graph and a target graph, we estimate the unknown labels such that the projections of the source and the target label functions onto the graph wavelets localized around matched graph nodes give similar coefficients. Employing also priors imposing the smoothness of the label functions and their consistency with the available labels, we formulate an optimization problem that is convex and quadratic in the label functions to be estimated, whose global minimum can be found analytically. Experimental results suggest that the proposed method performs successfully in domain adaptation problems and achieves state-of-the-art classification accuracy on various types of data sets.
The rest of the paper is organized as follows. In Section 2, we overview the related literature. In Section 3, we present a brief introduction to basic concepts in spectral graph theory, overview spectral graph wavelets, and then describe the proposed domain adaptation algorithm. In Section 4, we evaluate the performance of the proposed method with comparative experiments. We conclude in Section 5.
2 Related Work
The domain adaptation problem has been a well-studied topic of the recent years . Here, we give a brief overview of common approaches. The most basic form of domain adaptation is the covariate shift or sample selection bias problem, where the conditional distribution of the labels is assumed to be the same between the source and the target domains. Sample reweighing approaches can be successfully employed in this setting [27, 49]. In the more typical case where the conditional distributions are different, a relatively simple solution consists of mapping the source and the target data to a common high dimensional space via feature augmentation [13, 12, 14, 16]. Some works have extended the domain adaptation problem to settings with more than one source domain, where a target classifier is computed from the classifiers in the source domains [11, 62].
A quite prevalent approach in the literature is to align the source and the target domains using a transformation or a projection [19, 22, 67, 41, 29, 38, 18, 57, 59, 58]. In the assessment of the accuracy of aligning the two distributions, some widely used metrics are the maximum mean discrepancy [23, 61, 2, 39, 31, 21]; Wasserstein distance ; and distribution divergence measures . Some methods aim to directly match the densities or the covariances of the source and the target distributions in order to align the two domains [33, 48, 54], while others propose a solution based on learning a metric [64, 26] or sparse representations [50, 65]. In some works, the learning of a classifier is incorporated into the problem of learning a mapping [66, 5]. The recent method in  proposes to learn a classifier in the original data space via self-training. The extraction of domain-invariant features via deep networks has also been an active research topic of the last few years [20, 30, 60]
. A common strategy is to reduce the deviation between the source and the target features via adversarial learning, where the end classifiers are also often jointly optimized with the feature extraction layers[32, 55].
While many domain adaptation approaches rather focus on feature-space representations of data as discussed above, there are also methods that incorporate a graph model in the learning. Algorithms such as in  and  employ the assumption that the label function should be smooth on the data graph. However, there are much fewer examples of methods explicitly making use of graph bases or dictionaries for representing graph signals in classification problems as in our work, even outside the context of domain adaptation. The studies in [17, 43, 45] employ graph Fourier bases in solving multiview 3D shape analysis or clustering problems. The methods in [51, 53, 52] propose to use sparse signal representations on graphs via localized graph dictionaries, however in an unsupervised setting where the purpose is to reconstruct and approximate graph signals. Our preliminary study in , which proposes to represent label functions over graph Fourier bases, is a first attempt towards using graph signal processing techniques in domain adaptation problems. However, its utility is limited to settings where the source and the target graphs bear high resemblance, due to the difficulty of aligning and matching the Fourier bases of independently constructed source and target graphs.
3 Graph Domain Adaptation with Localized Signal Representations
In this section, we present our graph domain adaptation method. We first overview spectral graph theory and spectral graph wavelets in Sections 3.1 and 3.2. We then describe our domain adaptation algorithm based on transferring wavelet coefficients between the source and the target graphs in Section 3.3.
3.1 Overview of Spectral Graph Theory
Here, we briefly overview basic concepts from spectral graph theory  and graph signal processing . Throughout the paper, matrices and vectors are represented with uppercase and lowercase letters. Let be a graph consisting of vertices (nodes) in the vertex set and edges , where the matrix stores the edge weights. A graph signal is a function taking a real value on each graph node . A graph signal can equivalently be regarded as a vector in the -dimensional space, which we adopt in our notation.
The weight matrix is a symmetric matrix consisting of nonnegative edge weights, such that is the weight of the edge between the nodes and . If the nodes and are not connected with an edge, then . The weight matrix defines a diagonal degree matrix given by . The graph Laplacian is then defined as
which can be seen as an operator acting on a function via the matrix multiplication . The graph Laplacian
is of crucial importance in graph signal processing since it allows the extension of concepts such as Fourier transform and filtering to graph domains. Several previous studies have shown that the graph Laplaciancan be regarded as the graph equivalent of the well-known Laplace operator in the Euclidean domain, or the Laplace-Beltrami operator on manifold domains , , .
The Laplace operator in classical signal processing has the important property that complex exponentials
used in the definition of the Fourier transform are given by its eigenfunctions
In analogy with this property in classical signal processing, the eigenvectorsof the graph Laplacian satisfying
for are of special interest since they define a Fourier basis over graph domains . Indeed, for any graph, the eigenvector
corresponding to the smallest eigenvalueis always a constant function on the graph, while the speed of variation of on the graph increases for increasing . Hence, the eigenvalues of the graph Laplacian correspond to frequencies such that gives a measure of the speed of variation of the eigenvector regarded as a Fourier basis vector.
The definition of the graph Fourier basis allows the extension of the Fourier transform to graph domains as follows. Given a graph signal , its Fourier transform is defined as
such that the -th Fourier coefficient is given by the inner product of and the Fourier basis vector . Throughout the paper, stands for a frequency variable, and the Fourier transform is defined on the frequencies determined by the graph topology, as common convention in graph signal processing . The inverse Fourier transform similarly corresponds to the reconstruction of the graph signal as the sum of the Fourier basis vectors weighted by the Fourier coefficients
Based on these definitions, one can generalize the filtering operation to graph domains as well. Given a filter kernel specified in the spectral domain as a function of the frequency variable , an “input” graph signal can be filtered by multiplying its Fourier transform with the filter kernel as
where is the Fourier transform of the “output” signal after filtering. This spectral representation can then be transformed back to the vertex domain via the inverse Fourier transform in order to obtain the output signal as
3.2 Spectral Graph Wavelets
The wavelet transform is a widely used transform in many signal processing applications such as compression and reconstruction . In the recent years, several extensions of the wavelet transform have been proposed for graph domains [24, 8, 36]. In our graph domain adaptation problem, we would like to efficiently capture and transfer the local characteristics of label functions on graphs such as their vertex spread and speed of change. For this reason, in our work we prefer to use spectral graph wavelets  for representing label functions, which are theoretically shown to enjoy desirable properties such as good localization in the vertex domain and the spectral domain.
Spectral graph wavelets  are defined and characterized essentially in the spectral domain, based on the idea of extending the traditional wavelet transform to graphs. Considering a graph with nodes, the spectral graph wavelet transform is specified by a kernel in the spectral domain, which is a function of the frequency variable . The wavelet kernel represents a band-pass filter and acts on a graph signal
in the frequency domain via its Fourier transformas
This operation corresponds to band-pass filtering the signal to obtain a new signal . The graph signal can be reconstructed in the vertex domain by taking the inverse Fourier transform of as
where are the Fourier basis vectors.
This filtering operation is used for defining the spectral graph wavelets as follows. First, in order to define the graph wavelet transform at an arbitrary scale , the wavelet kernel is simply scaled as . Then, the spectral graph wavelet vector localized at node and having scale is obtained by band-pass filtering the Dirac delta function localized at node (which is the graph signal taking the value 1 at node and 0 elsewhere) using the filter kernel
Here denotes the filtering operation with kernel at scale ; and stands for the graph Fourier transform of the Dirac delta function as given by (2). Hence, the definition in (3) indicates how one can generate a collection of graph wavelet functions at different scales and localized at different graph nodes , where the exact structure of the wavelets are specified by the filter kernel , similarly to the way that wavelets are obtained from mother wavelet kernels in traditional signal processing. Having defined the band-pass wavelet functions based on the band-pass kernel , one can similarly start with a low-pass kernel and use it to define the scaling function localized at node , which is a low-pass graph signal just as in classical wavelet analysis . The spectral graph wavelets at different scales and the scaling function are demonstrated on two different graph topologies in Figure 3. It can be observed that the scaling function is a smooth and slowly varying function on both graphs, in line with the low-pass structure of the kernel . While both wavelets exhibit band-pass characteristics, has a faster variation and thus changes more abruptly than on both graphs, as its spectrum is shifted much more towards higher frequencies.
Finally, the wavelet coefficient of a signal at scale and localized at node is found by taking the inner product between and the wavelet . We refer the reader to  for a detailed analysis of the localization properties of graph wavelets and scaling functions and also on what constitutes a proper choice of the kernels and in order to ensure such properties. It is further discussed in  that, under an appropriate sampling of the scale parameter , the scaling function and the wavelets , form a frame. In our work, we use such a collection of graph signals, which consists of a low-pass scaling function and a couple of band-pass wavelets at different nodes. In the next section, we discuss how one can exploit these localized signal representations in order to transfer knowledge from one graph to another in domain adaptation.
3.3 Domain Adaptation with Spectral Graph Wavelets
We now focus on the problem of domain adaptation in a setting with a source graph and a target graph whose nodes represent the source data and the target data . In some settings, graphs are known and available beforehand, e.g. as in problems related to social networks, whereas in some other applications, the graphs are to be constructed from the data set. While the construction of the graphs is out of the scope of our work, it is often suitable to adopt common strategies such as choosing the neighbors with respect to the K-NN rule and setting the edge weights with a Gaussian kernel. Let and denote the source and the target graph Laplacian matrices obtained from the weight matrices and as in (1).
We consider a pair of class label functions and , respectively on the source and the target graphs. We address a setting where the class labels are largely known on the source graph, whereas few labels are known on the target graph. Let denote the label of the source node whenever is labeled, and let be defined similarly in the target domain. We assume that a small set of matched nodes between the source and the target graphs is available111An arbitrary indexing with the indices and is preferred in our notation since no relation is assumed to be known between the source and the target graphs apart from the matched node pairs and the node enumerations are assumed to be arbitrary.. These matched node pairs serve as anchor points in transferring the knowledge of the local behavior of the label functions and between the two graphs. In practice, these matched pairs may consist of data samples, e.g., extracted from the same resource or known to be related in a certain way.
Our method is based on the following idea: In a setting where the label functions and share similar local characteristics on the source and the target graphs, the wavelet coefficients of the label functions must also be similar. For a pair of matched nodes known to be related to each other, this assumption can be formulated as
at all scales . Here and respectively denote the wavelet at scale and the scaling function on the source graph localized at node , and similarly and denote those on the target graph localized at node . One may wonder about the validity of this assumption; i.e., whether the wavelet coefficients may indeed be expected to match in this way in practice, especially considering that the source and the target graphs are constructed independently. In order to examine this, the distribution of the wavelet coefficients are studied in an experiment in Figure 4. In this experiment, the source and the target domain samples are taken as face images from the MIT-CBCL  face data set captured under two different camera angles (frontal and profile). Each domain consists of a total of 360 images of 10 different participants viewed under varying lighting conditions. Some sample images from the source and the target domains are shown in Figure 3(a). The source and the target graphs are constructed with respect to the K-NN rule, using the samples in the source and the target domains respectively. The wavelet coefficients plotted in Figure 3(b) support our assumption and suggest that the projections of the source and target label functions onto the graph wavelets and scaling functions at the matched node pairs indeed bear high similarity, despite the fact that the two graphs are constructed independently.
Let and denote respectively the matrices whose columns consist of the source and the target scaling functions and wavelets for a sampling of the scale parameter at the matched node pair
In the determination of the wavelet functions, we adopt a logarithmic sampling of the scale parameter as proposed in . Let us also define the matrices and containing the scaling functions and wavelets at all matched node pairs as
In a setting where labels are completely available on the source graph, evaluating the graph wavelet coefficients of the source label function at the source matched nodes and transferring them to the target graph along their correspondences , one can reconstruct the target label function based on the transferred wavelet coefficients. If a reliable set of sufficiently many matches are available, one may expect to have a good reconstruction of on the target graph by solving the inverse problem
However, some possible challenges that might be encountered are the following. First, the number of matches might be limited in practice. Moreover, one may not always have a complete set of labels in the source domain, in which case the source wavelet coefficients cannot be computed. Due to these difficulties, we propose to estimate the label functions and by solving the following optimization problem
Here, and are label vectors consisting of the known labels and on the source and the target graphs, where and are the number of known labels on each graph. The matrices and are binary selection matrices consisting of ’s and ’s, which extract the labeled entries of and and enforce them to be in agreement with the known labels in the vectors and . The third term in (7) imposes the source wavelet coefficients to be similar to the target wavelet coefficients at the set of matched nodes. Finally, the last two terms serve as regularization terms that prevent the label function estimates and from varying too fast on the graphs, via the source and the target graph Laplacians and . The parameters , , and are nonnegative scalars weighting the contribution of each term.
In order to solve the optimization problem in (6), we first notice that the objective function is convex with respect to the optimization variables and , since the graph Laplacian matrices and are always positive semi-definite. We can then simply solve the problem by analytically setting the derivatives of with respect to and to 0:
Solving these two equations together, we get
This gives the estimates of the source and the target label functions and . We call the proposed algorithm Graph Domain Adaptation with Matched Local Projections (GrALP).
3.4 Complexity Analysis of the Algorithm
The complexity of the proposed method is analyzed in this section. First, assuming that the graph weight matrices and are known, the time complexities of computing the source and the target graph Laplacians and are of and . Noting that the complexities of calculating the eigenvalue decompositions of and are respectively of and , the computations of the source and the target wavelet matrices and have complexities of and .
From (10) we observe that the matrix can be computed with operations. Since the number of labeled samples is always smaller than or equal to , this complexity reduces to . The complexity of computing the matrix
in (9) can be found as . Since , this complexity reduces to .
Next, the matrix
in (9) can be calculated with additional operations, after which can be found with operations.
The overall complexity of finding the label functions and is then of . Since the number of matches is typically a small number, we may assume that is less than or . The overall complexity can then be simplified as .
4 Experimental Results
In this section, we first evaluate the performance of the proposed GrALP method with comparative experiments, and then analyze the sensitivity of the method to the selection of algorithm hyperparameters and wavelet kernels.
4.1 Comparative Experiments
We test the proposed method on four real data sets. The proposed GrALP algorithm is compared to the domain adaptation methods Subspace Alignment (SA) , Easy Adapt++ (EA++) , Geodesic Flow Kernel (GFK) , Joint Geometrical and Statistical Alignment (JGSA) , Scatter Component Analysis (SCA) , LDA-Inspired Domain Adaptation (LDADA) 68]. Among the domain adaptation methods, SA, GFK, and JGSA are based on aligning the source and the target domains with projections; EA++ is based on feature augmentation; LDADA learns a classifier in the original data domain; and the SCA method learns a discriminative representation in a reproducing kernel Hilbert space.
In all experiments, misclassification rates of the methods over the unlabeled target samples is evaluated under the following settings:
(Target sweep) The matches between the two domains are unlabeled and all other source samples are labeled. The ratio of the known target labels is varied.
(Source sweep) The matches are unlabeled and no label information is available in the target domain. The ratio of known source labels is varied.
(Unlabeled match sweep) The matches are unlabeled and no label information is available in the target domain. The ratio of matched nodes is varied.
(Partially labeled match sweep) A certain percentage of the matches and the source samples are labeled, and the unmatched target samples are unlabeled. The ratio of matched nodes is varied.
All methods are provided with all label information that leaks between the two domains through the matches. The SVM and the NN methods are trained over all labeled samples in the source and the target domains and the SSL method is applied on the target graph, which are the settings yielding the best performance. The SCA and the SSL algorithms require labeled samples in the target domain, therefore, are excluded from the experiments with no labeled target samples. One-hot label vectors are used in all methods that require an explicit representation of the label function. In data sets where the graphs are not readily available, a K-NN graph is constructed in each domain by connecting each sample to its nearest neighbors with respect to the Euclidean distance. The edge weights are set with a Gaussian kernel as . The parameters of the proposed GrALP method are selected as , , and in all experiments. The results obtained on different datasets are presented below.
MIT-CBCL face image data set. The MIT-CBCL data set  consists of images of 10 participants captured under varying poses and illumination conditions. The source and the target domains respectively consist of the frontal and the profile view images of the participants, with a total of 360 images in each domain. Some images from the data set are shown in Figure 3(a). Matched nodes consist of a pair of images of the same person captured under the same illumination conditions. The images are downsampled to a resolution of pixels. The source and the target graphs are constructed with nearest neighbors and Gaussian scale parameter . The results are averaged over 20 repetitions with random selections of matched samples and labeled samples.
The misclassification rate of the unlabeled target samples is plotted in percentage for all methods in Figure 5. The proposed GrALP method gives the best performance in Figures 4(a) and 4(b). The domain adaptation methods GrALP, LDADA, JGSA, GFK, and SA yield much higher classification accuracy than the other algorithms when a small set of target labels are available. As expected, domain adaptation algorithms perform better than basic classification methods. In the target sweep setting in Figure 4(a), the proposed GrALP method is observed to provide almost zero classification error, even for a very small percentage of target labels. Similarly, in the source sweep setting in Figure 4(b), the target error of GrALP drops quite quickly as the ratio of labeled source nodes starts increasing. GrALP can use source labels more effectively than the other methods as it employs the information of the matched node pairs. Considering that no label information is available in the target domain and the matches are also unlabeled in Figure 4(b), we conclude that the proposed GrALP algorithm is able to successfully transfer the label information from the source graph to the target graph through the wavelet coefficients over the matches. The misclassification rate in the target domain is observed to decrease down to although there is no label information in the target domain.
In the unlabeled match sweep setting in Figure 4(c), no labels are available in the target domain or on the matched nodes. Since the algorithms other than GrALP do not use the information of the matched nodes, their target misclassification rate is not affected by the number of matched nodes. As the ratio of matched nodes increases, the misclassification rate of the proposed GrALP algorithm decreases as expected, reaching zero misclassification rate when of the nodes are matched, despite the strict unavailability of labels in the target domain. While GrALP outperforms the other methods when there is a sufficient number of matches, we observe that the other methods perform better when the number of matches is too small. This is for the following reason: GrALP is a purely graph-based method that does not at all employ the ambient space representations of data samples once the source and the target graphs are constructed. Data samples are simply represented as abstract graph nodes and the only way the algorithm can link the source and the target domains is through the matched nodes. On the other hand, all the other methods (except for SSL) heavily rely on ambient space representations (feature vectors) of data samples. Having access to the physical coordinates of data unlike GrALP, they outperform GrALP when the number of matches is too few. The results of the partially labeled match sweep experiment in Figure 4(d) similarly show that the proposed method outperforms the others, provided that a sufficient number of matches (around ) is available.
COIL-20 object image data set. The COIL-20 data set  consists of a total of 1440 images of 20 objects. Each object has 72 images taken from different viewpoints rotating around it. We use this data set as follows for domain adaptation. The 20 objects in the data set are divided into two groups such that each object in the first group forms a pair with the object in the second group that is the most similar to it. The first and the second groups, each of which consist of the images of 10 different objects, are taken as the source domain and the target domain. Two objects that form a pair are considered to have the same class label. The 20 objects in the data set are shown in Figure 6. Matched nodes consist of images of paired objects captured under the same viewpoint. The images are downsampled to a resolution of pixels. The graphs are constructed with nearest neighbors and Gaussian scale parameter . The misclassification rates are averaged over random repetitions of the experiment.
The results of the experiment are presented in Figure 7. In Figures 6(a) and 6(b) the proposed GrALP method outperforms the other methods and yields quite high classification accuracy even for a very small number of labeled nodes. In Figure 6(c) the classification accuracy of GrALP exceeds that of the other methods as soon as the ratio of matched nodes attains when the matches are unlabeled, while its performance is seen to be even better in Figure 6(d) in case of partially labeled matches. The proposed method performs particularly well in this data set. Data samples are regularly sampled from the data manifold, resulting in an even and regular graph structure. This contributes positively to the accuracy of graph-based methods. One can indeed observe in Figure 6(a) that being a purely graph-based method, SSL also attains relatively high classification accuracy. The difference between the performances of GrALP and SSL is primarily due to the fact that, in addition to the labels known on the target graph, GrALP also exploits the information transferred from the source graph through the local wavelet coefficients.
Multilingual text data set. Next, we test the methods in a document categorization application. The multilingual text data set  contains 6 classes of documents interpreted in various languages. In particular, the overall data set consists of documents written originally in one language and translated into other languages. Documents are represented with bag-of-words feature vectors obtained using a TFIDF-based weighting scheme . The source and the target domains are taken respectively as English and French document sets, each of which contains a total of 1500 documents. A matched node pair consists of an English document and its translation into French. The dimension of feature vectors is reduced to 1000 with PCA as preprocessing. The graphs are constructed with
nearest neighbors and edge weights are computed using cosine similarity. The misclassification rates are averaged overrandom repetitions.
The results presented in Figure 8 show that the proposed GrALP method performs quite well in this data set. Even with a very small amount of matched nodes, the misclassification rate of GrALP is lower than that of the other methods. While the performances of the other domain adaptation methods consistently improve with the increase in the target labels in Figure 7(a), their performances improve slowly or stagnate with the increase in the source labels or matched nodes in Figures 7(b) and 7(d). We interpret this in the way that the bag-of-words feature representations of documents written in different languages are not easy to align by transformations or projections onto a common domain, therefore, the information available in the source domain cannot be not exploited efficiently. On the other hand, the graph-based GrALP algorithm performs better as it relies on relating the label information to the affinities between pairs of data samples in the same source graph and transmitting this information to a target graph of similar topology, instead of learning a classifier based on the ambient space representations of data samples.
Social network data set. Finally, we evaluate the methods on the Facebook data set , which consists of several subnetworks extracted from the Facebook network. Nodes represent Facebook users and edges indicate the friendship relationship between users in the graph of each network. We experiment on two networks representing different user communities with and users illustrated in Figure 9, which we take as the source and the target graphs. The two graphs contain 27 common users, which are deployed as matched nodes in our experiments. The data set contains information about users such as their education, work, or political affiliations. We have chosen the “gender” information as the label to be predicted, since it is provided for almost all users. The edge weight is taken as the constant value if a friendship relation exists between a pair of users.
Unlike in the previous data sets, data samples are not embedded in an ambient space in this social network data set, as each data sample represents a user. As the methods except GrALP and SSL require an ambient space representation of data as input, the data samples are mapped to the Euclidean space with the multidimensional scaling (MDS) algorithm  using the graph weight matrices. The dimension of the MDS embedding is empirically chosen as . The errors are averaged over random trials.
The misclassification errors of the methods in the target domain are presented in Figure 10. The balanced error rates are reported in these results in order to remove any bias due to the unequal presence of the two classes in the data set. Gender prediction on a social network graph is a challenging problem; nevertheless, the gender information seems to be implicitly encoded to some extent in the graphs of the two communities, which can also be observed by inspecting the variation of the label functions on the two graphs in Figure 9. The results in Figure 10 suggest that the proposed GrALP method may give promising results in this challenging setup. In particular, in Figure 9(a), GrALP performs better than the other methods when the ratio of available target labels is relatively low. In domain adaptation applications, typically no or few target labels are available, and the proposed method seems to effectively exploit the information in the source domain under such conditions. In Figure 9(b), GrALP is also observed to outperform the other methods under limited availability of the source labels. Figures 9(a) and 9(b) show that most of the methods have a correct classification rate fluctuating around in this binary classification problem, indicating that they cannot extract any information at all from labeled data samples. The comparison of the results in Figures 9(c) and 9(d) is particularly interesting. The classification accuracy of GrALP does not improve much with the increase in the number of matches in Figure 9(c) where the matched nodes are unlabeled, in contrast to its tendency to improve in Figure 9(d) where the matches are partially labeled. The knowledge of the labels at the matched nodes is seen to be critical in this data set unlike in the previous data sets. This can be explained with the fact that the label function (gender) has a much faster variation on the graphs in this data set compared to the previous ones, hence, the transfer of the projection coefficients onto the graph wavelets is more accurate when the wavelets are localized at labeled nodes, for instance, compared to the case where they are localized at some neighbor of a labeled node. The results in Figure 9(d) suggest that the proposed method can outperform the other domain adaptation methods if sufficiently many matches are available between the two graphs. The baseline SSL method performs particularly well at relatively large numbers of matches by diffusing in the target graph the label information leaking from the source graph through the labeled matches, while its misclassification rate is much higher when few matches are available.
4.2 Parameter Analysis of the Proposed Algorithm
We now analyze the sensitivity of the proposed GrALP algorithm to the choice of the algorithm parameters. First, we investigate the effect of the weight parameters , , and on the misclassification rate. The Multilingual text data set is used in this experiment. 90% of the source samples and 10% of the target samples are labeled and 10% of the graph nodes are matched. The target misclassification rates obtained with different combinations of , , and values are presented in Table 1. We observe that setting the parameters according to the rule of thumb yields good performance. This result is also confirmed on the MIT-CBCL and COIL-20 data sets.
Now, we analyze how the misclassification rate is influenced by the choice of the wavelet kernel types and the number of wavelet functions used in the representation of the label functions. The four different wavelet kernel types (AB spline, Mexican hat, Simple tight frame, Meyer) provided by the Spectral Graph Wavelets Toolbox (SGWT)  are tested in our experiments. The scaling and wavelet functions given by these wavelet kernel types are shown in Figure 11 for different number of wavelets.
The target misclassification rates obtained on the three data sets with these wavelet kernels are presented in Figure 12. 90% of the source samples are labeled, and 10% of the nodes are mathced, and no label information is available on the matched nodes or the target samples. Three different settings are tested with different combinations of the algorithm weight parameters. The magnitude of the Fourier coefficients of the label function is also plotted for each data set.
The results in Figures 11(a) and 11(b) show that target labels can be predicted with very high accuracy in the MIT-CBCL and COIL-20 data sets for the choice of the weight parameters as , , and . The wavelet kernel types and the number of wavelet functions does not affect the misclassification rate much in this setting. The amount of information transferred from the source graph in addition to the smoothing effect of the regularization term is enough to obtain very high classification performance in these two data sets. In the Multilingual data set, the classification error decreases as the number of wavelets increases as seen in Figure 11(c). The wavelet kernel type also affects the classification accuracy when the number of wavelets is small. The classification performance is better with the Simple tight frame or Meyer kernels, which are observed to have a better frequency coverage in Figure 11 compared to the AB Spline and Mexican hat kernels for a small number of wavelet functions. The AB spline and Mexican hat kernels have lower accuracy since these fail in representing certain parts of the spectrum, behaving like a band-stop filter if only one scaling function and one wavelet is used for instance.
In the second setting, the regularization terms are removed by choosing , in order to directly study the effect of the choice of the kernel type on the misclassification rate. An immediate observation in Figure 12 is that the performance degrades significantly in this second setting compared to the first one, which confirms the necessity of the regularization term. The results show that the misclassification rate first decreases and then increases in most data sets and kernel types in this setting. Using a too large number of wavelets degrades the performance for the following reason. As the number of wavelets increases, more and more wavelets capturing high frequency components of the label functions are involved in their representation, while transferring the high frequency information without regularization has an adverse effect as the high frequency part of the spectrum typically contains undesirable components such as noise or domain-specific variations. Examining the results for different data sets, we observe that AB spline and Mexican hat kernels perform better for the MIT-CBCL and COIL-20 data sets for a small number of wavelets. The plots in Figures 11(d) and 11(e) show that the spectra of the label functions are mostly concentrated at low frequencies in these two data sets, indicating that the label functions have a relatively slow variation on the graphs. The AB Spline and Mexican hat wavelets are more favorable in this case, as they are mostly concentrated around the low-frequency region of the spectrum when the number of wavelets is small. On the other hand, for the Multilingual data set, Meyer and Simple tight frame wavelets achieve better classification accuracy for a small number of wavelets. Due to the rather challenging structure of this data set, the label function varies relatively faster on the graphs, and has significant high-frequency components as seen in Figure 11(f). Consequently, the Meyer and Simple tight frame kernels perform better, as they cover the high-frequency part of the spectrum better than the AB Spline and the Mexican hat wavelets.
Finally, in the third setting the parameters are set and , in order to provide a comparison between our method and the reference solution that uses only the regularization term to predict the labels. As expected, this setting acts like a random classifier in all data sets, since there is no label information in the target domain and no information is transferred from the source domain in these experiments. A global conclusion of all these experimental results is that the effect of the wavelet kernel types and the number of wavelets can vary among different data sets. The kernels should be carefully selected, considering the task at hand, the properties of the label function, and possibly the graph topology.
We have presented a domain adaptation method for classification problems defined on graph domains. The proposed method is based on the idea of sharing and transferring the information of the local characteristics of the label function between a source graph and a target graph, by using the projection coefficients of the label function onto spectral graph wavelet functions. Unlike conventional domain adaptation approaches relying largely on representations in a feature space, the proposed algorithm has minimal dependence on the feature space properties of data and treats the problem in an abstract graph setting. This leads to a flexible data representation model turning out to be advantageous for problems that may be challenging to treat in the original data space. Experimental results on several real data sets demonstrate that the proposed graph-based method performs better than or competes with state-of-the-art domain adaptation approaches in most settings.
This work has been partly supported by the TÜBİTAK 2232 research scholarship under project number 117C007.
-  (2009) Learning from multiple partially observed views -an application to multilingual text categorization. In Proc. 22nd Int. Conf. Neural Information Processing Systems, NIPS’09, pp. 28–36. Cited by: §4.1.
Unsupervised domain adaptation by domain invariant projection.
IEEE International Conference on Computer Vision, pp. 769–776. Cited by: §2.
Domain adaptation on the statistical manifold.
IEEE Conference on Computer Vision and Pattern Recognition, pp. 2481–2488. Cited by: §2.
Geometric deep learning: going beyond euclidean data. IEEE Signal Process. Mag. 34 (4), pp. 18–42. Cited by: §1.
-  (2014) Semi-supervised domain adaptation on manifolds. IEEE Trans. Neural Netw. Learning Syst. 25 (12), pp. 2240–2249. Cited by: §2, §2.
-  (1996-12-03) Spectral Graph Theory (CBMS Regional Conference Series in Mathematics, No. 92). American Mathematical Society. Note: Paperback Cited by: §1.
-  (1997) Spectral graph theory. American Mathematical Society. Cited by: §3.1.
-  (2006) Diffusion wavelets. Applied and Computational Harmonic Analysis 21 (1), pp. 53 – 94. Note: Special Issue: Diffusion Maps and Wavelets External Links: Cited by: §3.2.
-  (2017) Optimal transport for domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell. 39 (9), pp. 1853–1865. Cited by: §2.
-  (2000) Multidimensional scaling. Chapman and hall/CRC. Cited by: §4.1.
-  (2008) Learning from multiple sources. Journal of Machine Learning Research 9, pp. 1757–1774. Cited by: §2.
-  (2010) Co-regularization based semi-supervised domain adaptation. In Proc. Advances in Neural Information Processing Systems 23, pp. 478–486. Cited by: §2.
-  (2007) Frustratingly easy domain adaptation. In Annual Meeting-Association for Computational Linguistics, Cited by: §2.
Frustratingly easy semi-supervised domain adaptation.
Proc. 2010 Workshop on Domain Adaptation for Natural Language Processing, pp. 53–59. Cited by: §2, §4.1.
-  (2019) Learning graphs from data: A signal representation perspective. IEEE Signal Process. Mag. 36 (3), pp. 44–63. Cited by: §1.
-  (2012) Learning with augmented features for heterogeneous domain adaptation. In Proc. 29th International Conference on Machine Learning, Cited by: §2.
-  (2015) Multimodal manifold analysis by simultaneous diagonalization of Laplacians. IEEE Trans. Pattern Anal. Mach. Intell. 37 (12), pp. 2505–2517. Cited by: §2.
Discriminative transfer learning on manifold. See DBLP:conf/sdm/2013, pp. 539–547. External Links: Cited by: §2.
-  (2013) Unsupervised visual domain adaptation using subspace alignment. In Proc. IEEE International Conference on Computer Vision, ICCV ’13, pp. 2960–2967. External Links: Cited by: §1, §2, §4.1.
Unsupervised domain adaptation by backpropagation. In Proc. 32nd International Conference on Machine Learning, pp. 1180–1189. Cited by: §1, §2.
-  (2017) Scatter component analysis: A unified framework for domain adaptation and domain generalization. IEEE Trans. Pattern Anal. Mach. Intell. 39 (7), pp. 1414–1430. Cited by: §1, §2, §4.1.
-  (2012) Geodesic flow kernel for unsupervised domain adaptation. See DBLP:conf/cvpr/2012, pp. 2066–2073. Cited by: §1, §2, §4.1.
-  (2016) Domain adaptation with conditional transferable components. In Proc. 33nd International Conference on Machine Learning, pp. 2839–2848. Cited by: §2.
-  (2011) Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis 30 (2), pp. 129 – 150. External Links: Cited by: §1, §3.2, §3.2, §3.2, §3.2, §3.3, §4.2.
-  (2005) From graphs to manifolds - weak and strong pointwise consistency of graph laplacians. Biologische Kybernetik, pp. 470–485. Cited by: §3.1.
-  (2017) Learning an invariant hilbert space for domain adaptation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3956–3965. Cited by: §2.
-  (2006) Correcting sample selection bias by unlabeled data. In Proc. Advances in Neural Information Processing Systems 19, pp. 601–608. Cited by: §1, §2.
-  (2012) Learning to discover social circles in ego networks. In Advances in neural information processing systems, pp. 539–547. Cited by: Figure 3, §4.1.
-  (2019) Exploring uncertainty in pseudo-label guided unsupervised domain adaptation. Pattern Recognition 96, pp. 106996. External Links: Cited by: §2.
-  (2015) Learning transferable features with deep adaptation networks. In Proc. 32nd International Conference on Machine Learning, pp. 97–105. Cited by: §1, §2.
-  (2014) Adaptation regularization: A general framework for transfer learning. IEEE Trans. Knowl. Data Eng. 26 (5), pp. 1076–1089. Cited by: §2.
-  (2017) Deep transfer learning with joint adaptation networks. In Proc. 34th International Conference on Machine Learning, pp. 2208–2217. Cited by: §2.
-  (2012) Semi-supervised domain adaptation with non-parametric copulas. In Proc. Advances in Neural Information Processing Systems 25, pp. 674–682. Cited by: §2.
-  (2018) An embarrassingly simple approach to visual domain adaptation. IEEE Trans. Image Processing 27 (7), pp. 3403–3417. Cited by: §2, §4.1.
MIT-CBCL face recognition database. Note: Available: http://cbcl.mit.edu/software-datasets/heisele/facerecognition-database.html Cited by: §3.3, §4.1.
-  (2012) Perfect reconstruction two-channel wavelet filter banks for graph structured data. IEEE Trans. Signal Processing 60 (6), pp. 2786–2799. Cited by: §3.2.
-  (1996-02) Columbia object image library (COIL-20). Technical report Technical Report CUCS-005-96, Department of Computer Science, Columbia University. Cited by: §4.1.
Transfer learning via dimensionality reduction.
Proc. Twenty-Third AAAI Conference on Artificial Intelligence, pp. 677–682. Cited by: §2.
Domain adaptation via transfer component analysis.
IEEE Trans. Neural Networks22 (2), pp. 199–210. Cited by: §1, §2.
-  (2010) A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22 (10), pp. 1345–1359. Cited by: §2.
-  (2018) Semi-supervised transfer subspace for domain adaptation. Pattern Recognition 75, pp. 235 – 249. External Links: Cited by: §2.
-  (2016) Domain adaptation via transferring spectral properties of label functions on graphs. See DBLP:conf/ivmsp/2016, pp. 1–5. Cited by: §2.
-  (2013) Sparse modeling of intrinsic correspondences. Comput. Graph. Forum 32 (2), pp. 459–468. Cited by: §2.
-  (2003) Using TF-IDF to determine word relevance in document queries. In Proc. 1st Instructional Conference on Machine Learning, Cited by: §4.1.
-  (2017) Partial functional correspondence. Comput. Graph. Forum 36 (1), pp. 222–236. Cited by: §2.
The emerging field of signal processing on graphs: extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Process. Mag. 30 (3), pp. 83–98. Cited by: §1, §3.1, §3.1, §3.1, §3.1.
-  (2006) From graph to manifold laplacian: the convergence rate. Applied and Computational Harmonic Analysis 21 (1), pp. 128 – 134. Note: Special Issue: Diffusion Maps and Wavelets External Links: Cited by: §3.1.
-  (2016) Return of frustratingly easy domain adaptation. In Proc. Thirtieth AAAI Conference on Artificial Intelligence, pp. 2058–2065. Cited by: §2.
-  (2011) A two-stage weighting framework for multi-source domain adaptation. In Proc. Advances in Neural Information Processing Systems 24, pp. 505–513. Cited by: §1, §2.
-  (2014) Sparsity regularization label propagation for domain adaptation learning. Neurocomputing 139, pp. 202 – 219. External Links: Cited by: §2.
-  (2015) Multi-graph learning of spectral graph dictionaries. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3397–3401. Cited by: §2.
-  (2018) Learning of robust spectral graph dictionaries for distributed processing. EURASIP J. Adv. Sig. Proc. 2018, pp. 67. Cited by: §2.
-  (2014) Learning parametric dictionaries for signals on graphs. IEEE Trans. Signal Processing 62 (15), pp. 3849–3862. Cited by: §2.
-  (2017) Unsupervised domain adaptation with copula models. In 27th IEEE Int. Workshop on Machine Learning for Signal Processing, pp. 1–6. Cited by: §2.
-  (2017) Adversarial discriminative domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2962–2971. Cited by: §2.
-  (1995) Wavelets and subband coding. Prentice-Hall, Inc., Upper Saddle River, NJ, USA. External Links: Cited by: §3.2.
-  (2008) Manifold alignment using procrustes analysis. See DBLP:conf/icml/2008, pp. 1120–1127. Cited by: §2.
-  (2011) Heterogeneous domain adaptation using manifold alignment. See DBLP:conf/ijcai/2011, pp. 1541–1546. Cited by: §2.
-  (2010) A geometric framework for transfer learning using manifold alignment. Ph.D. Thesis. Cited by: §2.
-  (2018) Deep visual domain adaptation: a survey. Neurocomputing 312, pp. 135 – 153. External Links: Cited by: §2.
-  (2019) Semi-supervised domain adaptation via fredholm integral based kernel methods. Pattern Recognition 85, pp. 185 – 197. External Links: Cited by: §2.
-  (2017) Online transfer learning with multiple homogeneous or heterogeneous sources. IEEE Trans. Knowl. Data Eng. 29 (7), pp. 1494–1507. Cited by: §2.
-  (2015) Feature space independent semi-supervised domain adaptation via kernel matching. IEEE Trans. Pattern Anal. Mach. Intell. 37 (1), pp. 54–66. Cited by: §2.
-  (2017) A unified framework for metric transfer learning. IEEE Trans. Knowl. Data Eng. 29 (6), pp. 1158–1171. Cited by: §2.
-  (2018) Learning domain-shared group-sparse representation for unsupervised domain adaptation. Pattern Recognition 81, pp. 615 – 632. External Links: Cited by: §2.
-  (2015) Semi-supervised domain adaptation with subspace learning for visual recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2142–2150. Cited by: §2.
-  (2017) Joint geometrical and statistical alignment for visual domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5150–5158. Cited by: §1, §2, §4.1.
-  (2003) Semi-supervised learning using gaussian fields and harmonic functions. See DBLP:conf/icml/2003, pp. 912–919. Cited by: §4.1.