Rethink Maximum Mean Discrepancy for Domain Adaptation

07/01/2020 ∙ by Wei Wang, et al. ∙ Dalian University of Technology 9

Existing domain adaptation methods aim to reduce the distributional difference between the source and target domains and respect their specific discriminative information, by establishing the Maximum Mean Discrepancy (MMD) and the discriminative distances. However, they usually accumulate to consider those statistics and deal with their relationships by estimating parameters blindly. This paper theoretically proves two essential facts: 1) minimizing the MMD equals to maximize the source and target intra-class distances respectively but jointly minimize their variance with some implicit weights, so that the feature discriminability degrades; 2) the relationship between the intra-class and inter-class distances is as one falls, another rises. Based on this, we propose a novel discriminative MMD. On one hand, we consider the intra-class and inter-class distances alone to remove a redundant parameter, and the revealed weights provide their approximate optimal ranges. On the other hand, we design two different strategies to boost the feature discriminability: 1) we directly impose a trade-off parameter on the implicit intra-class distance in MMD to regulate its change; 2) we impose the similar weights revealed in MMD on inter-class distance and maximize it, then a balanced factor could be introduced to quantitatively leverage the relative importance between the feature transferability and its discriminability. The experiments on several benchmark datasets not only prove the validity of theoretical results but also demonstrate that our approach could perform better than the comparative state-of-art methods substantially.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 8

page 9

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Due to the available substantial amount of labeled data, traditional machine learning algorithms have achieved remarkable performances on the object recognition task when the training and test data follow the same or similar distributions. In most realistic scenarios, though, only a fully-labeled source domain is available to us, from which we do wish to learn a transferable classifier from the source domain, to correctly predict the data labels for a new target domain with different distribution

[ACM1, ACM2, ACM3]. Fortunately, Domain Adaptation (DA) as a novel emerging technique has prepared us for this pressing challenge, or rather, it has been committed to narrowing the distributional differences between the source and target domains so that extensive knowledge in the source domain could be desirably transferred to the target, thus the relabeling consumption is mitigated largely [ACM4, ACM5, ACM6].

Fig. 1: Working principle of the MMD. Different color circles represent various categories, the tiny circles are the means of specific categories, the hollow and meshed circles represent the source and target domains, the solid arrows denote the DA processes with MMD, and the comparatively larger circles are the transformed data features.

An essential issue in the DA is to formulate an appropriate metric of the distributional distance to be used for measuring the proximity of two different distributions. Over the years, numerous distance-metric methods were proposed. For example, the Quadratic [BD], Kullback-Leibler [KL], Mahalanobis [RTML]

distances that are derived from the Bregman Divergence but generated by different convex functions were introduced to explicitly match the two different distributions. However, it is inflexible to extend them into different DA models due to their inconvenient manipulation. Additionally, on account of theoretical deficiencies, they are unable to describe more complicated distributions such as conditional and joint distributions. The Wasserstein distance from optimal transport problem exploited a transportation plan also aligning the two different distributions

[WD], but it is still very tedious to be applied to subspace-learning DA methods because it will come down to a complex bi-level optimization problem that is nontrivial to optimize.

Noticeably, the Maximum Mean Discrepancy (MMD) [MMD]

, a metric based on the embedding of distribution measures in a reproducing kernel Hilbert space, has been applied successfully in a wide range of problems thanks to its simplicity and solid theoretical foundation, such as transfer learning

[TCA]

, kernel Bayesian inference

[KBP], approximate Bayesian computation [ABC], two-sample [MMD], goodness-of-fit testing [GFT], MMD GANs [GMMN], and auto-encoders [IFVAE], etc. For the DA setting, MMD aimed to minimize the deviation of means that are respectively computed by the source and target domains. Together with label constraint, the class-wise MMD [JDA] further diminished the mean deviations of each two coupled classes with the same labels but respectively from the two different domains. Moreover, MMD and class-wise MMD are usually adopted to respectively measure the marginal and conditional distribution differences between the two different domains [JDA]. Notably, this paper only concentrates on rethinking the class-wise MMD since MMD is a special case of the class-wise MMD when we regard a domain as one category, and the DA performances are mainly determined by the class-wise MMD.

The early DA methods usually directly integrated the MMD with some predefined models [Dann1, DDC, TCA, SPDA, TSC], while some recent approaches were proposed to further study the MMD in-depth and elaborately improved the MMD using some prior data information [BDA, MEDA, WMMD, DAN, JAN, HoMM]. However, the adverse impacts of MMD on some domain-specific data properties that are implicit in the original feature space are ignored carelessly, such as the feature discriminability and the local manifold structure of data, etc. Differently, some remarkable algorithms were raised to accumulate to devise various regularization losses independently of the MMD to offset the negative influences that arise from the MMD [VDA, DICD, TIT, GEF, LPJT, TRSC, ARTL, SCA, JGSA]. Especially, some discriminative DA methods established the discriminative distances (i.e., intra-class and inter-class distances) like Local Discriminant Analysis (LDA) [HMLDA] to further respect the discriminative information of each specific domain. However, it is hard to study the relationships between those statistics qualitatively and quantitatively due to deficiencies in the theory, so that more parameters have to be estimated blindly and the learned DA models are often unstable. Strikingly, this paper mainly aims to answer the following two essential questions: 1) how exactly does the MMD minimize the mean deviations between the two different domains? 2) why does the MMD usually produce an unexpected deterioration of the feature discriminability?

As we all known, the DA technique, as a branch of transfer learning, also aimed to simulate the transferable intelligence of human beings, so that the feature representations of those images with the same semantic (i.e., category) are as similar as possible. As shown in Fig. 1, we expect to leverage the common features shared between the source and target domains by minimizing their mean deviations of each pair of categories, even though they follow very different distributions, but how exactly does it work? Strikingly, this paper presents a novel insight into the MMD and theoretically reveals that its working principle could reach a high consensus with the transferable behavior of human beings. As shown in Fig. 1, given a pair of classes with the label Desktop_computer respectively from the source and target domains, 1) the two relatively smaller red circles (i.e., the hollow and meshed ones) are transformed into the larger red ones which are magnified greatly (i.e., maximizing their specific intra-class distances); 2) the two tiny red circles are gradually drawn closer to each other along with their specific arrows (i.e., minimizing their joint variance). It is generally known that human beings attempt to abstract the common feature of a provided semantic, so that it could embrace all its possible appearances, i.e., 1), but the detailed information is decayed heavily, i.e., 2). Therefore, the proposed novel insight is highly consistent with the transferability of human beings, and the detailed theoretical proof will be elaborated in the Section II-C. Remarkably, this paper theoretically proves that the discriminative distances involved in the MMD are distinctly different from the ones in the LDA [HMLDA] since there exist different weights imposed on various classes. As shown in Fig. 1, those relatively smaller circles are enlarged and drawn closer with varying degrees.

Given the facts illustrated above, the reasons for degradation of the feature discriminability in the MMD could be therefore revealed. As can be seen from Fig. 1, the common features of Desktop_computer and Trash_can (resp. Laptop_computer and Monitor) are quite similar, and the feature discriminability performs worse than before since the circles of different categories will chaotically overlap to each other. This observation provides us qualitative and quantitative guidance in studying the relationship between the MMD and the discriminative distances (i.e., transferability vs discriminability), and devising more robust and effective DA models. Qualitatively, this paper proposes a discriminative MMD with two different strategies to prompt the leveraged features more discriminative: 1) we directly impose a trade-off parameter on the implicit intra-class distance of the MMD to regulate its change; 2) we impose the similar weights revealed in the MMD on the inter-class distance and maximize it. Although two parameters are still involved, we have no longer need to estimate it blindly in the unknown regions and it is easy to know how the feature properties will change with varying parameter values since we experimentally observe that the implicit weights imposed on the discriminative distances revealed here are very close to the optimal ones empirically set in existing discriminative DA models (Section IV-C), which theoretically gives the approximate optimal parameter ranges, thus different feature properties could be regulated more exactly. Specifically, the trade-off parameter in the first strategy could be set between and , where the intra-class distance is regulated corresponding with different physical meanings (i.e., suppressing/removing its expansion, enhancing its compactness instead). Moreover, the balanced factor in the second manner could be set between and , to quantitatively leverage the relative importance between the feature transferability and its discriminability. Finally, we consider the intra-class and inter-class distances alone in the proposed two strategies, since this paper proves that their relationship is as one falls, another rises, thus the redundant parameter could be omitted. By and large, the main contributions of our work are three-folds:

  • This paper theoretically proves that the working principle of the MMD, and illustrate the reasons for degradation of feature discriminability, which provides qualitative and quantitative guidance in studying their relationship and devising more robust and effective DA models.

  • This paper proposes a discriminative MMD with two different strategies, which provides the approximate optimal parameter ranges and the corresponding physical meanings with different parameter values, thus the different feature properties could be regulated more exactly.

  • This paper considers the intra-class and inter-class distances alone in the proposed two strategies since we prove that their relationship is as one falls, another rises, thus the redundant parameter could be further omitted.

Ii Rethink MMD

Ii-a Preliminary

In this paper, a matrix is denoted as the bold-italic uppercase letter (e.g., X

) but a column vector is represented as the bold-italic lowercase letter (e.g.,

x), and is the -th column of X. In addition, is the value from the -th row and the -th column of X, and is the -th value of . is the Frobenius norm of X, and is the -norm of . The superscript and are the transpose and trace operators. The indexes of source and target domains are denoted using the subscripts and . The index of the -th category is defined by the superscript .

In DA scenario, there have a labled source domain (i.e., data matrix , label vector ) but an unlabeled target domain (i.e., ), where is the feature dimension and is the number of source/target data instances, and the whole data matrix is (). Notably, if not stated otherwise, represents the data matrix of any given domain, and its label vector is . Our goal is to jointly project the source and target domains data into a common feature subspace so that the distributional differences between their new feature representations (i.e., ) are minimized substantially.

Ii-B Revisit MMD

To be specific, the marginal distribution difference between the two domains could be measured using the Maximum Mean Discrepancy (MMD), which computes the deviation of their specific means, and could be formulated as follows

(1)

where is computed as follows

(2)

Together with label constraint, the class-wise MMD is modeled to approximately measure the conditional distribution difference across the two domains, which further computes the mean deviations of each two coupled classes with the same labels but from different domains, and it could be defined as follows

(3)

where (resp. ) is the data samples pertaining to the -th category of source domain (resp. target domain), and its number is (resp. ). Likewise, the is computed as follows

(4)

Notably, the MMD is a special case of class-wise MMD where the whole source/target domain data is regarded as one category. Once those two metrics of distributional distance are established, we can jointly reduce the marginal and conditional distribution differences between the source and target domains by minimizing the MMD and class-wise MMD losses in a given feature learning framework (e.g., the Principal Component Analysis, PCA), and it could be formulated as follows

(5)

where the constraint means that the whole data variance is tied to a fixed value, so that the data information on the subspace could be statistically preserved to some extent. controls the scale of projection A, and is the trade-off parameter. Notably,

denotes an identity matrix with the size of

, is a matrix whose elements are all with the size of , and is a centering matrix.

Ii-C Rethink MMD

Although the MMD has been widely utilized in the cross-domain problem, its detailed working principle is still under insufficient exploration so far. Notably, this paper only focuses on the class-wise MMD because it plays a decisive role in the DA performances, and the MMD is a special case of class-wise MMD.

This paper mainly answers the following two essential questions: 1) how exactly does the MMD minimize the mean deviations between the two different domains? 2) why does the MMD usually produce an unexpected deterioration of the feature discriminability? To this end, a novel insight is firstly provided theoretically, and we now present Lemma 1 Lemma 3 as follows

Lemma 1. We have the following identity about the inter-class distance according to [HMLDA]

(6)

where , and is the inter-class scatter matrix. Notably, / denotes the mean of data instances from the /-th category, the number of data instances is , and m is the mean of the whole data samples.
Lemma 2. The inter-class distance equals to the data variance minus the intra-class distance

(7)

where is the variance matrix, and is the intra-class scatter matrix.

Proof. .
because and . Then , and . This completes the proof.
Lemma 3. We have the following identity about the MMD

(8)

where , , and . Notably, is the number of data instances of the -th category from the -th domain, and . Besides, is the data mean of the -th category from the -th domain, and is the data mean of the -th category from both the source and target domains.
Proof. Given the data instances of -th categories respectively from the source and target domains, because , and Lemma 1, Lemma 2, we have

(9)

Notably, the Eq. 8 is the sum of Eq. 9. This completes the proof.

From Eq. 9, it could be concluded that the MMD aims to respectively maximize the source and target intra-class distances but jointly minimize their variance with different weights, which are separately established by each pair of classes with the same labels but from the source and target domains. This conclusion illustrates the detailed working principle of the MMD that how it precisely minimizes the mean deviations. Moreover, since the intra-class distance is enlarged greatly and the whole data variance is tied to a fixed value (i.e., Eq. 5), the inter-class distance will be drawn closer (i.e., Lemma 2). Thus, different classes will chaotically overlap to each other with various degrees, and the feature discriminability is degraded largely, which provides us qualitative and quantitative guidance in studying the relationship between the MMD and the discriminative distances, and devising more robust and effective DA models. Based on this, in the next subsection, we will propose a discriminative MMD with two different strategies to prompt the extracted features more discriminative.

Iii The Proposed Approach

According to Lemma 3, we could rewrite Eq. 5 using the equivalent formulation of the original MMD as follows

(10)

where , and and are the Laplacian matrix of and , respectively. We let , and define as follows

(11)

Moreover, we define and , thus . For the convenience of matrix operation in Eq. 10, we utilize and to define and as follows

(12)
(13)

Now, we devise a discriminative MMD with two different strategies, and they could be formulated as Eq. 14 and Eq. 15.

Iii-a The First Strategy

(14)

Here, a trade-off parameter between and is directly imposed on the implicit intra-class distance of MMD. Notably, the revealed weights could provide us the theoretical guidance for setting parameter imposed on the intra-class distance, and intensifying the feature discriminability more correctly, since the optimal parameter regions are revealed by the implicit weights, and we exactly knew how it will change the leveraged feature properties with varying .

Specifically, there exist three cases respecting Eq. 14: 1) the expansion of intra-class distance is gradually mitigated when ; 2) the adverse influences on the intra-class distance is offset exactly when ; 3) the intra-class compactness is positively stimulated instead of weakening it when .

Similar to previous work [JDA, BDA, VDA], Eq. 14 is equivalent to a generalized eigen-decomposition problem as follows

(15)

where is a diagonal matrix with Lagrange Multipliers. The Eq. 15

can be effectively and efficiently solved by calculating the eigenvectors corresponding to the

-smallest eigenvalues.

Iii-B The Second Strategy

(16)

where , are the Laplacian matrix, , , and . Similarly, , and , where , , and could be computed as follows

(17)

Similar to and , and could be constructed accordingly. Different from the first strategy, we aim to further reformulate the inter-class distances of the source and target domains using the similar weights revealed in the MMD so that the optimal parameter regions imposed on the inter-class distances are also known beforehand. Therefore, a balanced factor of could be employed to adaptively leverage the importance of feature transferability and its discriminability. Likewise, Eq. 15 could be solved in the same manner as Eq. 14.

Moreover, from Lemma 3, we consider those two discriminative distances alone in the proposed two strategies since their relationship is as one falls, another rises when the whole data variance is fixed, thus the redundant parameter could be further omitted.

Iii-C Classification Scheme

We utilize the whole common features to construct a neighborhood similarity graph with the -neareast neighbors similar to [CPCAN, GAKT]. Specifically, the weight mesures the similarity degree between and , and the closer, the bigger. We now employ a Graph-based Label Propogation method [GAKT] (GLP) to propagate the source labels to the target domain data as follows

(18)

where , are the one-hot labels, and , if . Besides, we define the graph Laplacian matrix , where B denotes a diagonal matrix with the diagonal entries as the column sums of W, and .

We first obtain the partial derivative of Eq. 17 w.r.t., , and set it to 0. Then the solution can be derived as follows

(19)

Once the one-hot label matrix is obtained, the target label of any given data instance is computed as .

Since the statistics about the target domain are computed by their pseudo labels during the current iteration, we have to optimize A and iteratively. Remarkably, our approach could achieve desirable performances within only a few iterations (i.e., ).

MethodsTasks CA CW CD AC AW AD WC WA WD DC DA DW average
JDA [JDA] 45.62 41.69 45.22 39.36 37.97 39.49 31.17 32.78 89.17 31.52 33.09 89.49 46.38
BDA [BDA] 44.89 38.64 47.77 40.78 39.32 43.31 28.94 32.99 91.72 32.50 33.09 91.86 47.15
MEDA [MEDA] 56.50 53.90 50.30 43.90 53.20 45.90 34.00 42.70 88.50 34.90 41.20 87.50 52.70
ARTL [ARTL] 44.10 31.50 39.50 36.10 33.60 36.90 29.70 38.30 87.90 30.50 34.90 88.50 44.30
VDA [VDA] 46.14 46.10 51.59 42.21 51.19 48.41 27.60 26.10 89.18 31.26 37.68 90.85 49.03
SCA [SCA] 43.74 33.56 39.49 38.29 33.90 34.21 30.63 30.48 92.36 32.32 33.72 88.81 44.29
JGSA [JGSA] 51.46 45.42 45.86 41.50 45.76 47.13 33.21 39.87 90.45 29.92 38.00 91.86 50.04
DICD [DICD] 47.29 46.44 49.68 42.39 45.08 38.85 33.57 34.13 89.81 34.64 34.45 91.19 48.96
TIT [TIT] 59.70 51.50 48.40 47.50 45.40 47.10 34.90 40.20 87.90 36.70 42.10 84.80 52.20
GEF [GEF] 48.23 47.80 50.32 42.65 46.44 36.94 33.57 34.03 92.36 35.44 34.76 90.51 49.42
Our-I 60.44 54.92 54.78 46.04 53.90 44.59 34.82 42.28 93.63 36.69 45.30 95.25 55.22
Our-II 59.39 57.97 56.05 45.41 52.54 47.77 35.62 45.09 95.54 38.20 45.30 95.93 56.24
TABLE I: Average Classification Accuracy(%) of Office-10 vs Caltech-10 with the SURF features
MethodsTasks CA CW CD AC AW AD WC WA WD DC DA DW average
ALEXNET [AlexNet] 91.90 83.70 87.10 83.00 79.50 87.40 73.00 83.80 100.0 79.00 87.10 97.70 86.10
DDC [DDC] 91.90 85.40 88.80 85.00 86.10 89.00 78.00 84.90 100.0 81.10 89.50 98.20 88.20
DAN [DAN] 92.00 90.60 89.30 84.10 91.80 91.70 81.20 92.10 100.0 80.30 90.00 98.50 90.10
JDA [JDA] 89.70 83.70 86.60 82.20 78.60 80.20 80.50 88.10 100.0 80.10 89.40 98.90 86.50
MEDA [MEDA] 93.40 95.60 91.10 87.40 88.10 88.10 93.20 99.40 99.40 87.50 93.20 97.60 92.80
ARTL [ARTL] 92.40 87.80 86.60 87.40 88.50 85.40 88.20 92.30 100.0 87.30 92.70 100.0 90.70
VDA [VDA] 92.17 82.71 87.26 86.20 80.68 81.53 87.80 91.75 100.0 88.60 92.90 99.66 89.27
SCA [SCA] 89.46 85.42 87.90 78.81 75.93 85.35 74.80 86.12 100.0 78.09 89.98 98.64 85.88
JGSA [JGSA] 91.44 86.78 93.63 84.86 81.02 88.54 84.95 90.71 100.0 86.20 91.96 99.66 89.98
DICD [DICD] 91.02 92.20 93.63 86.02 81.36 83.44 83.97 89.67 100.0 86.11 92.17 98.98 89.88
GEF [GEF] 91.34 88.81 91.08 83.97 78.64 85.99 83.88 89.25 100.0 86.29 92.28 98.98 89.21
Our-I 93.42 95.93 95.54 87.44 92.20 91.72 87.18 91.75 100.0 87.27 93.53 100.0 93.00
Our-II 93.42 95.93 96.82 88.42 92.88 91.72 88.87 92.17 100.0 88.87 93.63 100.0 93.56
TABLE II: Average Classification Accuracy(%) of Office-10 vs Caltech-10 with the DECAF-6 features
TasksMethods IP