## I Introduction

In the field of artificial intelligence (AI), and particularly in machine learning, storing the knowledge learned by solving one problem and applying it to another similar problem is very challenging. For example, the knowledge gained from recognizing cars could be used to help recognize trucks, value predictions for US real estate could help predict real estate values in Australia, or knowledge learned from classifying English documents could be used to help classify Spanish documents. As such, transfer learning models

[1, 2, 3] have received tremendous attention by scholars in object recognition [4, 5, 6, 7], AI planning [8][9, 10, 11], recommender systems [12, 13], and nature language processing

[14]. Compared to traditional single-domain machine learning models, transfer learning models have clear advantages. (1) The knowledge learned from in one domain - the source domain - can help improve prediction accuracy in another domain - the target domain - particularly when the target domain has scant data [15]. And, 2) knowledge from a labeled domain can help predict labels for an unlabeled domain, which may avoid a costly human labeling process [16]. Of the proposed transfer learning models, domain adaptation models have demonstrated good success in various practical applications in recent years [17, 18].Most domain adaptation models focused on homogeneous unsupervised domain adaptation (HoUDA); that is, where the source and target domains have the same, or very similar, feature spaces and there is no labeled instances in target domains. However, given the time and cost associated with human labeling, many target domains are heterogeneous and unlabeled, which means most existing HoUDA models do not perform well on the majority of target domains. Current heterogeneous unsupervised domain adaptation (HeUDA) models need parallel sets to bridging two heterogeneous domains, i.e., there are some the same or very similar instances in both heterogeneous domains. However, if the domains are confidential and private, finding similar instances between two domains is not possible, e.g., credit assessment data is confidential and private, we do not access the information of each instance. To the best of our knowledge, there are rare discussions when there is no parallel set in the HeUDA setting. This gap limits domain adaptations models to be used for more scenarios.

The aim of this research is to fill this gap by establishing a foundation for HeUDA models that predict labels for a heterogeneous and unlabeled target domain without parallel sets. This work is motivated by the observation that two heterogeneous domains may come from one domain, which means that two heterogeneous domains could be outputs of heterogeneous maps of this domain. For example, given sentences written in Latin, they can be translated into sentences written in French and Spanish. These French and Spanish sentences have different representations but share a similar meaning. If the Latin sentences are labeled by ”positive”, then French and Spanish sentences are probably labeled by ”positive”. In this example, the domain consisting of French sentences and the domain consisting of Spanish sentences come from the Latin domain. Based on this observation, we formalize the heterogeneous unsupervised domain adaptation problem and present two key factors,

and , to reveal similarity between two heterogeneous domains:1) the variation (

) between the conditional probability density functions of both domains; and

2) the distance () between feature spaces of the two domains.

Then, homogeneous representations are constructed via preserving the original similarity between two heterogeneous domains, while allowing knowledge to be transferred. In general, small means that two domains have similar ground-truth labeling functions and small means that two feature spaces are close. We denote , , and by values of and of the original heterogeneous () domains and the homogeneous () representations. The basic assumption of unsupervised domain adaptation models is that two domains have similar ground-truth labeling functions. Hence, the constructed homogeneous representations must make . Similarly, is expected, indicating that the distance between two feature spaces of the homogeneous representations is small. This paper mainly focus on how to construct the homogeneous representations where and (the exact homogeneous representations of two heterogeneous domains). For the representations, HoUDA models can be applied to transfer knowledge across them (HoUDA models can minimize and ).

The unsupervised knowledge transfer theorem sets out the transfer conditions necessary to prevent negative transfer (to make ). Linear monotonic maps (LMMs) meet the transfer conditions of the theorem and, hence, are used to construct the homogeneous representations. Rather than directly measuring the distance between two heterogeneous feature spaces, the distance between two feature subspaces of different dimensions is measured using the principal angles of a Grassmann manifold. This new distance metric reflects the extent to which the homogeneous representations have preserved the geometric relationship between the original heterogeneous domains (to make ). It is defined on two pairs of subspace sets; one pair of subspace sets reflects the original domains, the other reflects the homogeneous representations. Homogeneous representations of the heterogeneous domains are constructed by minimizing the distance metric based on the constraints associated with LMMs, i.e., minimize under the constraints . Knowledge is transferred between the domains through the homogeneous representations via a geodesic flow kernel (GFK) [4]. The complete HeUDA model resulting from this research incorporates all these elements and is called the Grassmann-LMM-GFK model - GLG for short. To validate GLG’s efficacy, five public datasets were reorganized into ten tasks across three applications: cancer detection, credit assessment, and text classification. The experimental results reveal that the proposed model can reliably transfer knowledge across two heterogeneous domains when the target domain is unlabeled and there is no parallel set. The main contributions of this paper are:

1) an effective heterogeneous unsupervised domain adaptation model, called GLG, that is able to transfer knowledge from a source domain to an unlabeled target domain in settings where both domains have heterogeneous feature spaces and are free of parallel sets;

2) an unsupervised knowledge transfer theorem that prevents negative transfer for HeUDA models; and

3) a new principal angle based metric shows the extent to which homogeneous representations have preserved the geometric distance between the original domains, and reveals the relationship between two heterogeneous feature spaces.

This paper is organized as follows. Section II includes a review of the representative domain adaptation models. Section III introduces the GLG model, and its optimization process is presented in Section IV. Section V describes the experiments conducted to test the model’s effectiveness. Section VI concludes the paper and discusses future works.

## Ii Related work

In this section, homogeneous unsupervised domain adaptation models and heterogeneous domain adaptation models are reviewed, which are most related to our work. Then, GLG is compared to these models.

### Ii-a Homogeneous unsupervised domain adaptation

HoUDA is the most popular research topic, with three main techniques for transferring knowledge across domains: the Grassmann manifold method [4, 16, 19, 20], the two-sample test method [21, 22, 23, 24, 25, 26], and the graph matching method [18]. GFK seeks the best of all subspaces between the source and target domains, using the geodesic flow of a Grassmann manifold to find latent spaces through integration [4]. Subspace alignment (SA) [19] maps a source PCA subspace into a new subspace which is well-aligned with the target subspace. Correlation Alignment (CORAL) [27] matches the covariance matrix of the source subspace and target subspace. Transfer component analysis (TCA) [28] applies maximum mean discrepancy (MMD [29]) to measure the distance between the source and target feature spaces and optimizes this distance to make sure the two domains are closer than before. Information-theoretical learning (ITL) [30]

identifies feature spaces where data in the source and the target domains are similarly distributed. ITL also learns feature spaces discriminatively so as to optimize an information-theoretic metric as a proxy to the expected misclassification errors in the target domain. Joint distribution adaptation (JDA)

[31] improves TCA by jointly matching marginal distributions and conditional distributions. Scatter component analysis (SCA) [17] extends TCA and JDA, and considers the between and within class scatter. Wasserstein Distance Guided Representation Learning (WDGRL) [32]minimizes the distribution discrepancy by employing Wasserstein Distance in neural networks. Deep adaptation networks (DAN)

[33] and joint adaptation networks (JAN) [34] employee MMD and deep neural networks to learn the best representations of two domains.### Ii-B Heterogeneous domain adaptation

There are three types of heterogeneous domain adaptation models: heterogeneous supervised domain adaptation (HeSDA), heterogeneous semi-supervised domain adaptation (HeSSDA), and unsupervised domain adaptation (HeUDA).

HeSDA/HeSSDA aims to transfer knowledge from a source domain to a heterogeneous target domain, where two domains have different features. There is less literature in this setting than for homogeneous situations. The main models are heterogeneous spectral mapping (HeMap) [35], manifold alignment-based models (MA) [36], asymmetric regularized cross-domain transformation (ARC-t) [37], heterogeneous feature augmentation (HFA) [38], co-regularized online transfer learning [14], semi-supervised kernel matching for domain adaptation (SSKMDA) [39], the DASH-N model [40], Discriminative correlation subspace model [41] and semi-supervised entropic Gromov-Wasserstein discrepancy [42].

Of these models, ARC-t, HFA and co-regularized online transfer learning only use labeled instances in both domains; the other models are able to use unlabeled instances to train models. HeMap works by using spectral embedding to unify different feature spaces across the target and source domains, even when the feature spaces are completely different [35]. Manifold alignment derives its mapping by dividing the mapped instances into different categories according to the original observations [36]. SSKMDA maps the target domain points to similar source domain points by matching the target kernel matrix to a submatrix of the source kernel matrix based on a Hilbert Schmidt Independence Criterion [39]. DASH-N is proposed to jointly learn a hierarchy of features combined with transformations that rectify any mismatches between the domains and has been successful in object recognition [39]. Discriminative correlation subspace model is proposed to find the optimal discriminative correlation subspace for the source and target domain. [42] presents a novel HeSSDA model by exploiting the theory of optimal transport, a powerful tool originally designed for aligning two different distributions.

Unsupervised domain adaptation models based on homogeneous feature spaces have been widely researched. However, HeUDA models are rarely studied due to two shortcomings of current domain adaptation models: the feature spaces must be homogeneous, and there must be at least some labeled instances in the target domain (or there must be a parallel set in both domains). Hybrid heterogeneous transfer learning model [43] uses the information of the parallel set of both domains to transfer knowledge across domains. Domain Specific Feature Transfer [44] is designed to address HeUDA problem when two domains have common features. Kernel canonical correlation analysis (KCCA) [45] was proposed to address HeUDA problems when there are paired instances in source and target domains, but this model is not valid when there are no paired instance in both domains.

### Ii-C Comparison to related work

Scatter component analysis (SCA) model, as an example for existing HoUDA models, incorporates a fast representation learning algorithm for unsupervised domain adaptation. However, this model can only transfer knowledge across homogeneous domains.

SSKMDA model, as an example for existing HeSSDA models, uses kernel matching method to transfer knowledge across heterogeneous domains. However, again, this model relies on labeled instances in the target domain to help correctly measure the similarity between the heterogeneous feature spaces (measure and , mentioned in Section I). GLG relies on the unsupervised knowledge transfer theorem to maintain and the principal angles of a Grassmann manifold to measure the distance () between two heterogeneous feature spaces. Therefore, GLG does not require any labeled instances. A metric based on principal angles reflects the extent to which the homogeneous representations have preserved the geometric distance () between the original heterogeneous domains. Knowledge is successfully transferred across heterogeneous domains by minimizing this metric under the conditions of the unsupervised knowledge transfer theorem.

Compared to existing HeUDA models, e.g., Kernel canonical correlation analysis (KCCA) model, they can transfer knowledge between two heterogeneous domains when both domains have paired instances and the target domain is unlabeled. However, the models are invalid when there no paired instances exist. GLG is designed to transfer knowledge without needing paired instances and is based on a theorem that prevents negative transfer.

These demonstrations fill some foundational gaps in the field of unsupervised domain adaptation, which will hopefully leads to further research advancements.

## Iii Heterogeneous Unsupervised domain adaptation

Our HeUDA model, called GLG, is built around an unsupervised knowledge transfer theorem that avoids negative transfer through a variation factor that measures the difference between the conditional probability density functions in both domains. The unsupervised knowledge transfer theorem guarantees linear monotonic maps (LMMs) against negative transfer once used to construct homogeneous representations of the heterogeneous domains (because ). A metric, which reflects the distance between the original domains and the homogeneous representations, ensures the distance factor between the original domains is preserved (). Thus, the central premise of the GLG model is to find the best LMM such that the distance between the original domains is preserved.

### Iii-a Problem statement and notation settings

Following our motivation (two heterogeneous domains may come from one domain), we first give a distribution over a random multivariable defined on an instance set , and a target function . The value of corresponds to the probability that the label of is 1. In this paper, we use to represent , where is the label of and . The random multivariables of feature spaces of two heterogeneous domains are heterogeneous images of :

(1) |

where , , and . In the heterogeneous unsupervised domain adaptation setting, and we can observe a source domain and a target domain , where , are observations of the random multivariables and , respectively, and , taking value from , is the label of . builds up a features space of and builds up a features space of and builds up of a label space of . In following, and for short. HeUDA problem is how to use knowledge from to label each in .

### Iii-B Unsupervised knowledge transfer theorem for HeUDA

This subsection first presents relations between and (or ) and then gives the definition of the variation factor () between and . Through the definition of , we give an unsupervised knowledge transfer theorem for HeUDA. Given a measurable subset , we can obtain the probability . So, we expect that the probability and are around . If is regarded as Latin sentences mentioned in Section I, and are representations of the French and Spanish sentences translated from the Latin sentences. If the Latin sentences are labeled by ”positive” (), we of course expect that the French and Spanish sentences have high probabilities to be labeled by ”positive”. Formally, we assume a following equality holds.

(2) |

where and are two real-value functions. Since two heterogeneous domains have a similar task (i.e., labeling sentences as “positive” or “negative”), we know and should be around 1 and have following properties for any .

(3) |

The properties described in (III-B) guarantee that two heterogeneous domains are similar. For example, if , we will have , indicating that positive Latin sentences are represented by negative French sentences. Based on (2), we define the variation factor as follows.

(4) |

Then, the definition of the heterogeneous unsupervised domain adaptation condition follows. Satisfying this condition indicates that the knowledge has been transferred in the expected way.

###### Definition 1 (HeUDA condition).

Given , and the equality (2), if there are two maps and , then, , the heterogeneous unsupervised domain adaptation condition can be expressed by following equality.

(5) |

where a measurable set.

If this condition is satisfied, it is clear

and

indicating that and will not cause extreme negative transfer and .

Although Definition 1 provides the basic transfer condition in HeUDA scenarios, some properties of and still need to be explored to determine which kind of maps satisfy this condition. Monotonic maps are one such map, defined as:

###### Definition 2 (monotonic map).

If a map satisfies the following condition

where and are binary relations and “” is a strict partial order over and , then the map is a monotonic map.

Based on Definition 2, the proposed unsupervised knowledge transfer theorem follows.

###### Theorem 1 (unsupervised knowledge transfer theorem).

Given , and the equality (2), if there are two maps and satisfy that

1) and are monotonic maps;

2) and ;

then and satisfy the heterogeneous unsupervised domain adaptation conditions.

###### Proof.

For simplicity, we let

and for short. Based on the equality (2), we have

Let and . Since and , we have

and

Because is a monotonic map, there must be a 1-1 map between and , that is,

Hence, we arrive at following equation.

That is,

So, we have

and this theorem is proven. ∎

Based on Theorem 1, we demonstrate some properties of and and highlight the sufficient conditions for reliable unsupervised knowledge transfer. If a mapping function from heterogeneous domains to homogeneous representations satisfies the two conditions mentioned in Theorem 1, it can transfer knowledge across domains with theoretical reliability.

### Iii-C Principal angle based measurement between heterogeneous feature spaces

In this subsection, the method for measuring the distance between two subspaces is introduced. On a Grassmann manifold (or ), subspaces with (or ) dimensions of are regarded as points in (or ). This means that measuring the distance between two subspaces can be calculated by the distance between those two points on the Grassmann manifold. First, the subspaces spanned by and

are confirmed using singular value decomposition (SVD). Then, the distance between the spanned subspaces

and can be calculated in terms of the corresponding points on the Grassmann manifold.There are two HoUDA models that use a Grassmann manifold in this way: DAGM and GFK. The DAGM was proposed by Gopalan et al. [16]. GFK was proposed by Gong and Grauman [4]. Both have one shortcomings: the source domain and the target domain must have feature spaces of the same dimension, mainly due to the lack geodesic flow on and (). In [46], Ye and Lim successfully proposed the principal angles between two different dimensional subspaces, which helps us to measure distance between two heterogeneous feature spaces consisting of and . Principal angles for heterogeneous subspaces are defined as follows.

###### Definition 3 (principal angles for heterogeneous subspaces [46]).

Given two subspaces and (), which form the matrixes and , the

principal vectors

, are defined as solutions for the optimization problem :Then, the principal angles for heterogeneous subspaces are defined as

Ye and Lim [46] prove that the optimization solution for Definition 3 can be computed using SVD. Thus, we can calculate the principal angles between different dimensional subspaces, and this idea forms the distance factor mentioned in Section I. To perfectly define distances in subspaces of different dimensions, Ye and Lim use two Schubert varieties to prove that all the defined distances in subspaces of the same dimensions are also correct when the dimensionalities differ. This means we can calculate the distances between two subspaces of different dimensions using the principal angles defined in Definition 3. Given and , the distance vector between and are defined as cosine values of principal angles between and , which has a following expression.

where , is the singular value of computed by SVD (the principal angles ).

If we can find two maps and satisfying conditions of Theorem 1, we can obtain the as following.

where and . Hence, we can measure distance between and via these singular values.

### Iii-D The proposed HeUDA model

With the unsupervised knowledge transfer theorem defined, which ensures the reliability of heterogeneous unsupervised domain adaptation, and with the principal angles of Grassmann manifolds explained, we now turn to the proposed model, GLG. The optimization solution for GLG is outlined in Section IV.

A common idea for finding the homogeneous representations of heterogeneous domains is to find maps that can project feature spaces of different dimensions (heterogeneous domains) onto feature spaces with same dimensions. However, most models require at least some instances in the target domain to be labeled to maintain the relationship between the source and target domains. Thus, the key to an HeUDA model is to find a few properties that can be maintained between the original domains and the homogeneous representations. Here, these two factors are the variation factor ( and ) and the distance factor ( and ) defined in previous subsections. Theorem 1 determines the properties the maps should satisfy to make and principal angles shows the distance between two heterogeneous (or homogeneous) feature spaces ( and ), but there are still two concerns: 1) which type of mapping function is suitable for Theorem 1; and 2) which properties should the map maintain between the original domains and the homogeneous representations. The first concern with the unsupervised knowledge transfer theorem is addressed by selecting LMMs as the map of choice.

###### Lemma 1 (linear monotonic map).

Given a map with form , is a monotonic map if and only if or , where and .

###### Proof.

, without loss of generality, we assume (). Because , we have

So,

Because and and are any vector in satisfying , if and only if . We can simply prove the is a decreasing monotonic map if and only if . ∎

Since the defined map in Lemma 1 only uses and according to the generalized inverse of matrix, a matrix must satisfy . Therefore, we can prove that LMMs satisfy the conditions in Theorem 2.

###### Theorem 2 (LMM for HeUDA).

Given , and the equality (2), if there are two maps and are LMMs, then and satisfy the HeUDA condition.

###### Proof.

###### Remark 1.

From this theorem and the nature of LMMs, we know this positive map can better handle datasets that have many monotonic samples because the probabilities in these monotonic samples can be preserved without any loss. The existence of these samples has the greatest probability of preventing negative transfers.

Theorem 2 addresses the first concern and provides a suitable map, such as the map in Lemma 1, to project two heterogeneous feature spaces onto the same dimensional feature space. It is worthwhile showing that an LMM is just one among many suitable maps for Theorem 1. A nonlinear map, such as exp(), also can be used to construct the map as long as the map is monotonic. In future work, we intend to explore additional maps suitable for further HeUDA models.

This brings us to the second concern: Which properties can be maintained during the mapping process between the original domains and the homogeneous representations. As mentioned above, the principal angles play a significant role in defining the distance between two subspaces on a Grassmann manifold, and in explaining the projections between them [47]. Hence, ensuring the principal angles remain unchanged is one option for maintaining some useful properties. Specifically, for any two pairs of subspaces () and (), if the principal angles of () and () are the same (implying that min{dim(), dim()}= min{dim(), dim()}, dim() represents the dimension of ), then relationship between and can be regarded as similar to the relationship between and . Based on this idea, the definition of measurement , which describes the relationships between two pairs of subspaces, follows.

###### Definition 4 (measurement between subspace pairs).

Given two pairs of subspaces () and (), the measurement ((), ()) between () and () is defined as

(6) |

where and are subspaces in , =min{dim(), dim(), dim(), dim()} and is the singular value of matrix and represents cosine value of the principal angle between and .

Measurement defined on is actually a metric, as proven in the following theorem.

###### Theorem 3.

() is a metric space, where .

###### Proof.

Let and are subspaces in . Thus, we need to prove following conditions.

1) ;

2) ;

3) ;

4) .

The definition of the consistency of the geometry relation with respect to feature spaces of two domains can be given in terms of the metric as follows.

###### Definition 5 (consistency of the geometry relation).

Given the source domain and the heterogeneous and unlabeled target domain , let and , if , such that

(7) |

then we can say and have consistent geometry relations, where , , , , and is a matrix of ones of the same size as .

This definition precisely demonstrates how and influence the geometric relation between the original feature spaces and the feature spaces of homogeneous representations. If there are slight changes in the original feature spaces, we hope the feature spaces of the homogeneous representations will also see slight changes. If they do, it means that the feature spaces of the homogeneous representations are consistent with the geometry relations of the two original feature spaces. If we use definitions of and , (7) can be expressed by

(8) |

To ensure the consistency of the geometric relation of the two original feature spaces, we can minimize following cost function to ensure we are able to find an that is less than , such that when there are slight changes in the original feature spaces.

###### Definition 6 (cost function I).

Given the source domain and the heterogeneous and unlabeled target domain , let and , the cost function of GLG is defined as

(9) |

where , , , and is a matrix of ones of the same size as .

This definition shows the divergence between the original feature spaces and the feature spaces of the homogeneous representations via principal angles. If we use to represent the principal angle of the original feature spaces and to represent the principal angle of the feature spaces of the homogeneous representations, measures the divergence of when the original feature spaces have slight changes. and are used to smooth and . is set to , and is set to . When , and are set to 0.
From Definition 6, it is obvious that the maps and will ensure all principal angles slightly change as approaches 0, even when there is some disturbance of up to . Thus, based on Theorem 2 and Definition 6, the GLG model is presented follows.

Model (GLG).
The model GLG aims to find , to minimize the cost function , as defined in (6), while and are LMMs. GLG is expressed as

and are the new instances corresponding to and in the homogeneous representations with a dimension of . Knowledge is then transferred between and using GFK. Figure 1 illustrates GLG’s process.

Admittedly, LMMs are somewhat restrictive map because all elements in the U must be positive numbers. However, we use LMMs to prevent negative transfers that can significantly reduce prediction accuracy in the target domain. From the perspective of the entire transfer process, an LMM, as a positive map, is the only map that can help construct the homogeneous representations ( and ). The GFK model provides the second map, which does not have such rigid restrictions and makes and . Hence, the composite map (LMM+GFK) does not carry rigid restrictions and can therefore handle more complex problems. LMMs ensure correctness, thus avoiding negative transfer, and the GFK model (or another HoUDA model developed in future work) improves the ability to transfer knowledge. Following theorem demonstrates that GFK is a degenerate of the GLG model.

###### Theorem 4 (degeneracy of GLG).

Given the source domain and the heterogeneous and unlabeled target domain , if two domains are homogeneous (), then the GLG model degenerates into the GFK model.

###### Proof.

Proving this theorem only requires proving that the optimized and

in the GLG model are identical matrixes when

. In terms of Theorem 3, it is obvious that . So, if and , then we have (when ,), which results in the optimal GLG model.Because and , the GLG model degenerate into an ordinary GFK model. ∎

Since this optimization issue is related to the subspaces spanned by the original instances ( and ) and the subspaces spanned by the distributed instances ( and ), the best way to efficiently arrive at an optimized solution is a difficult and complex problem. Section IV proposes the optimization algorithm, focusing on the solution for GLG.

## Iv Optimization of GLG

According to (6), we need to calculate 1) , and 2) the integration with respective to to minimize via a gradient decent algorithm, where , , and . Because deriving and

contains the process of spanning a feature space to be a subspace. Thus, when there are some disturbances in an original feature space, the microscopic changes of the eigenvectors in an Eigen dynamic system (EDS) need to be analyzed (Eigenvectors are used to construct the subspaces spanned by a feature space, i.e.,

and ). The following subsection discusses the microscopic analysis of an Eigen dynamic system.### Iv-a Microscopic analysis of an Eigen dynamic system

In this section, we explore the extent of the changes in subspace when the feature space () has suffers a disturbance, expressed as . Without loss of generality, assume (formed as an

Comments

There are no comments yet.