Domain2Vec: Domain Embedding for Unsupervised Domain Adaptation

Conventional unsupervised domain adaptation (UDA) studies the knowledge transfer between a limited number of domains. This neglects the more practical scenario where data are distributed in numerous different domains in the real world. The domain similarity between those domains is critical for domain adaptation performance. To describe and learn relations between different domains, we propose a novel Domain2Vec model to provide vectorial representations of visual domains based on joint learning of feature disentanglement and Gram matrix. To evaluate the effectiveness of our Domain2Vec model, we create two large-scale cross-domain benchmarks. The first one is TinyDA, which contains 54 domains and about one million MNIST-style images. The second benchmark is DomainBank, which is collected from 56 existing vision datasets. We demonstrate that our embedding is capable of predicting domain similarities that match our intuition about visual relations between different domains. Extensive experiments are conducted to demonstrate the power of our new datasets in benchmarking state-of-the-art multi-source domain adaptation methods, as well as the advantage of our proposed model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 23

page 24

12/06/2017

Stretching Domain Adaptation: How far is too far?

While deep learning has led to significant advances in visual recognitio...
12/02/2018

Unsupervised Domain Adaptation using Generative Models and Self-ensembling

Transferring knowledge across different datasets is an important approac...
11/08/2016

Domain Adaptation with L2 constraints for classifying images from different endoscope systems

This paper proposes a method for domain adaptation that extends the maxi...
03/13/2020

DAN: Dual-View Representation Learning for Adapting Stance Classifiers to New Domains

We address the issue of having a limited number of annotations for stanc...
08/18/2021

A New Bidirectional Unsupervised Domain Adaptation Segmentation Framework

Domain shift happens in cross-domain scenarios commonly because of the w...
08/24/2021

Meta Self-Learning for Multi-Source Domain Adaptation: A Benchmark

In recent years, deep learning-based methods have shown promising result...
02/17/2019

Unsupervised Domain Adaptation using Deep Networks with Cross-Grafted Stacks

Popular deep domain adaptation methods have mainly focused on learning d...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generalizing models learned on one visual domain to novel domains has been a major pursuit of machine learning in the quest for universal object recognition. The performance of the learned methods degrades significantly when tested on novel domains due to the presence of

domain shift [domainshift].

Recently, Unsupervised Domain Adaptation (UDA) methods have been proposed to mitigate domain gap. For example, several learning-based UDA models [JAN, tzeng2014deep, long2015] incorporate Maximum Mean Discrepancy loss to minimize the domain discrepancy; other models propose different learning schema to align the marginal feature distributions of the source and target domains, including aligning second-order correlation [sun2015return, peng2017synthetic]

, moment matching 

[zellinger2017central], GAN-based alignment [CycleGAN2017, hoffman2017cycada, UNIT], and adversarial domain confusion [adda, DANN, MCD_2018]. However, most of the current UDA methods consider domain adaptation between limited number of domains (usually one source domain and one target domain). In addition, the state-of-the-art UDA models mainly focus on aligning the feature distribution of the source domain with that of the target domain, and fail to consider the natural distance and relations between different domains. In the more practical scenarios where multiple domain exists and the relations between different domains are unclear, it is critical to evaluate the natural domain distances between source and target so to be able to select one or several domains from the source domain pool such that the target domain can achieve the best performance.

Figure 1: Our Domain2Vec architecture achieve deep domain embedding by by joint learning of feature disentanglement and Gram matrix. We employ domain disentanglement (red lines) and class disentanglement (blue lines) to extract domain-specific features and category specific features, both trained adversarially. We further apply a mutual information minimizer to enhance the disentanglement.

In this paper, we introduce the Domain2Vec embedding to represent domains as elements of a vector space. Formally, given distinct domains = {***In this literature, the calligraphic denote Gram matrix and domains, and italic denote feature generator and disentangler, respectively. domains, the aim is the learn a domain to vector mapping . We would like our Domain2Vec to hold the following properties: (i) given two domains , , the accuracy of a model trained on and tested on should be negatively correlated to the domain distance in the vector space , i.e. smaller domain distance leads to better cross-domain performance; (ii) the domain distance should match our intuition about visual relations, for example, the domain distance of two domains with building images (, ) should be smaller than that of (, ). Our domain embedding can be used to reason about the space of domains and solve many unsupervised domain adaptation problems. As a motivating example, we study the problem of selecting the best combination of source domains when a novel target domain emerges.

Computation of the Domain2Vec embedding leverages a complementary term between the Gram matrix of deep representations and the disentangled domain-specific feature. Gram Matrices are commonly used to build style representations that compute the correlations between different filter activations in a deep network [gatys2015neural]. Since activations of a deep network trained on a visual domain are a rich representation of the domain itself, we use Gram Matrix to capture the texture information of a domain and further obtain a stationary, multi-scale representation of the input domain. Specifically, given a domain defined by with (

) examples, we feed the data through a pre-train reference convolutional neural network which we call feature generator

, and compute the activations of the fully connected layer as the latent representation , as shown in Figure 1. Inspired by the feature disentanglement idea [DAL_DADA], we introduce a disentangler to disentangle into domain-specific feature and category-specific feature . Finally, we compute the Gram matrix of the activations of the hidden convolutional layers in the feature extractor. Given a domain , we average the domain-specific features of all the training examples in as the prototype of domain . We utilize the concatenation of prototype and the diagonal entries of the average Gram matrix as the final embedding vector of domain . We show this embedding encodes the intrinsic properties of the domains (Sec 4).

To evaluate our Domain2Vec model, a large-scale benchmark with multiple domains is required. However, state-of-the-art cross-domain datasets contain only a limited number of domains. For example, the large-scale DomainNet [domainnet] that contains six domains, and the Office-31 [office] benchmark that only has three domains. In this paper, we create two large-scale datasets to facilitate the research of multi-domain embedding. TinyDAdataset is by far the largest MNIST-style cross domain dataset. It contains 54 domains and about one million training examples. Following Ganin et al [DANN], the images are generated by blending different foreground shapes over patches randomly cropped from the background images. The second benchmark is DomainBank

, which contains 56 domains sampled from the existing popular computer vision datasets.

Based on TinyDA dataset, we validate our Domain2Vec model’s property on the negative correlation between the cross-domain performance and the domain distance computed by our model. Then, we show the effectiveness of our Domain2Vec on multi-source domain adaptation. In addition, comprehensive experiments on DomainBank benchmark with openset domain adaptation and partial domain adaptation schema demonstrate that our model achieves significant improvements over the state-of-the-art methods.

The main contributions of this paper are highlighted as follows: (i) we propose a novel learning paradigm of deep domain embedding and develop a Domain2Vec model to achieve the domain embedding; (ii) we collect two state-of-the-art benchmarks to facilitate research in multiple domain embedding and adaptation. (iii) we conduct extensive experiments on various domain adaptation settings to demonstrate the effectiveness of our proposed model.

2 Related Work

Vectorial Representation Learning Discovery of effective representations that capture salient semantics for a given task is a fundamental goal for perceptual learning. The individual dimensions in the vectorial embedding have no inherent meaning. Instead, it is the overall patterns of location and distance between vectors that machine learning takes advantage of. GloVe [pennington2014glove] models achieve global vectorial embbedings for word by training on the nonzero elements in a word-word co-occurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. DECAF [donahue2014decaf] investigates semi-supervised multi-task learning of deep convolutional representations, where representations are learned on a set of related problems but applied to new tasks which have too few training examples to learn a full deep representation. Modern state-of-the-art deep models [alexnet, vgg, resnet, resnext, huang2017densely] learn semantic representations with supervision and are applied to various vision and language processing tasks. Another work which is very related to our work is the Task2Vec model [achille2019task2vec] which leverage the Fisher Information Matrix as the vectorial representation of different tasks. However, the Task2Vec model mainly consider the similarity between different tasks. In this work, we mainly focus on the same task and introduce a Domain2Vec framework to achieve deep domain embedding for multiple domains. Specifically, Domain2Vec is initially proposed in the work of Deshmukh et al [deshmukh2018domain2vec]. However, their model is designed for domain generalization. Our model is developed independently for a different purpose.

Unsupervised Domain Adaptation Deep neural networks have achieved remarkable success on diverse vision tasks [resnet, renNIPS15fasterrcnn, he2017mask] but at the expense of tedious labor work on labeling data. Given a large-scale unlabeled dataset, it is expensive to annotate enough training data such that we can train a deep model that generalizes well to that dataset. Unsupervised Domain Adaptation [office, long2015, DANN, MCD_2018, domainnet, DAL_DADA, SE] provides an alternative way by transferring knowledge from a different but related domain (source domain) to the domain of interest (target domain). Specifically, unsupervised domain adaptation (UDA) aims to transfer the knowledge learned from one or more labeled source domains to an unlabeled target domain. Various methods have been proposed, including discrepancy-based UDA approaches [JAN, ddc, ghifary2014domain, peng2017synthetic], adversary-based approaches [cogan, adda, ufdn], and reconstruction-based approaches [yi2017dualgan, CycleGAN2017, hoffman2017cycada, kim2017learning]. These models are typically designed to tackle single source to single target adaptation. Compared with single source adaptation, multi-source domain adaptation (MSDA) assumes that training data are collected from multiple sources. Originating from the theoretical analysis in [ben2010theory, Mansour_nips2018, crammer2008learning], MSDA has been applied to many practical applications [xu2018deep, duan2012exploiting, domainnet]. Specifically, Ben-David et al [ben2010theory] introduce an -divergence between the weighted combination of source domains and a target domain. Different from the previous work, we propose a Domain2Vec model to evaluate the natural distances between different domains.

Deep Feature Disentanglement Deep neural networks are known to extract features where multiple hidden factors are highly entangled [zhuang2015supervised]. Learning disentangled representations can help to model the relevant factors of data variation as well as evaluate the relations between different domains by extracting the domain-specific features. To this end, recent work [mathieu2016disentangling, makhzani2015adversarial, ufdn, cisac_gan] leverages generative adversarial networks (GANs) [gan]

or variational autoencoders (VAEs) 

[vae] to learn the interpretable representations. Under the multi-domain setting, Liu et al. [ufdn] propose a unified feature disentanglement framework to learn domain-invariant features from data across different domains. Odena et al. [cisac_gan]

introduce an auxiliary classifier GAN (AC-GAN) to achieve representation disentanglement under supervised setting. Recent work  

[drit, DAL_DADA]

propose to disentangle the features into a domain-invariant content space and a domain-specific attributes space, producing diverse outputs without paired training data. In this paper, we propose a cross-disentanglement schema to disentangle the deep features into

domain-specific and category-specific features.

3 Domain2Vec

We define the domain vectorization task as follows: given N domains = { domains, the aim is the learn a domain to vector mapping , which is capable of predicting domain similarities that match our intuition about visual relations between different domains. Our Domain2Vec includes two components: we first leverage feature disentanglement to generate the domain-specific features, and then we achieve deep domain embedding by the joint learning of Gram Matrix of the latent representations and the domain-specific features.

3.1 Feature Disentanglement

Given an image-label pair (x,y), a deep neural network is a family of function , trained to approximate the posterior by minimizing the cross entropy loss , where is the empirical distribution defined by the -th domain with training examples, . It is beneficial, especially in domain vectorization task, to think of the deep neural network as composed of two parts: a feature generator which computes the latent representations of the input data, and a classifier which encodes the distribution given the representation .

The latent representations

are highly entangled with multiple hidden factors. We propose to disentangle the hidden representations to

domain-specific and category-specific features. Figure 1 shows the proposed model. Given domains, the feature extractor maps the input data to a latent feature vector , which contains both the domain-specific and category-specific factors. The disentangler is trained to disentangle the feature to domain-specific feature and category-specific feature with cross-entropy loss and adversarial training loss. The feature reconstructor is responsible to recover from (,) pair, aiming to keep the information integrity in the disentanglement process. To enhance the disentanglement, we follow Peng et al [DAL_DADA] to apply a mutual information minimizer between and . A category classifier is trained with class labels to predict the class distributions and a domain classifier is trained with domain labels to predict the domain distributions. In addition, the cross-adversarial training step removes domain information from and category information from . We next describe each component in detail.

Category Disentanglement Given an input image , the feature generator computes the latent representation . Our category disentanglement is achieved by two-step adversarial training. First, we train the disentangler and the -way category classifier to correctly predict the class labels, supervised by the cross-entropy loss:

(1)

where and indicates the class label.

In the second step, we aim to remove the domain-specific information from . Assume that we already have a well-trained domain classifier (which is easy to with by Equation 3), we freeze the parameters in the domain classifier and train the disentangler to generate , aiming to fool the domain classifier. This can be achieved by minimizing the negative entropy of the predicted domain distribution:

(2)

This adversarial training process corresponds to the blue dotted line in Figure 1. The above adversarial training process forces the generated category-specific feature contains no domain-specific information.

Domain Disentanglement To achieve deep domain embedding, disentangling category-specific features is not enough, as it fails to describe the relations between different domains. We introduce domain disentanglement to disentangle the domain-specific features from the latent representations. Previous adversarial-alignment based UDA models [adda, DAL_DADA] propose to leverage a domain classifier to classify the input feature as source or target. However, the proposed domain classifier is a binary classifier, which can not be applied to our case directly. Similar to category disentanglement, our domain disentanglement is achieved by two step adversarial training. We first train the feature generator and disentangler to extract the domain-specific feature , supervised by domain labels and cross-entropy loss:

(3)

where and denotes the domain label.

In the second step, we aim to remove the category-specific information from . Assume the classifier has been well-trained in the category disentanglement, we freeze the parameters in the category classifier and train the disentangler to generate , aiming to fool the category classifier . Similarly, we can minimize the negative entropy of the predicted class distribution:

(4)

This adversarial training process corresponds to the red dotted line in Figure 1. If a well-trained category classifier is not able to predict the correct class labels, the category-specific information has been successfully removed from .

Feature Reconstruction Previous literature [DAL_DADA] has shown that the deep information could be missing in the feature disentangle process, especially when the feature disentangler is composed of several fully connected and Relu layers and it cannot guarantee the information integrity in the feature disentanglement process. We therefore introduce a feature reconstructor to recover the original feature with the disentangled domain-specific feature and category-specific feature. The feature reconstructor has two input and will concatenate the (,) pair to a vector in the first layer. The feature vector is feed forward to several fully connected and Relu layers. Denoting the reconstructed feature as , we can train the feature reconstruction process with the following loss:

(5)

where the first term aims at recovering the original features extracted by

, and the second term calculates Kullback-Leibler divergence which penalizes the deviation of latent features from the prior distribution (as ).

Mutual Information Minimization The mutual information is a pivotal measure of the mutual dependence between two variables. To enhance the disentanglement, we minimize the mutual information between category-specific features and domain-specific features. Specifically, the mutual information is defined as:

(6)

where

is the joint probability distribution of (

), and , are the marginal probability of and , respectively. The conventional mutual information is only tractable for discrete variables, for a limited family of problems where the probability distributions are unknown [mine]. To address this issue, we follow [DAL_DADA]

to adopt the Mutual Information Neural Estimator (MINE) 

[mine] to estimate the mutual information by leveraging a neural network : . Practically, MINE can be calculated as - . To avoid computing the integrals, we leverage Monte-Carlo integration to calculate the estimation:

(7)

where

are sampled from the joint distribution,

is sampled from the marginal distribution , is number of training examples, and is the neural network parameteralized by to estimate the mutual information between and , we refer the reader to MINE [mine] for more details.

3.2 Deep Domain Embedding

Our Domain2Vec model to learn domain to vector mapping by joint embedding of the Gram matrix and domain-specific features. Specifically, given a domain , we compute the disentangled features for all the training examples of . The prototype of domain is defined as: , denoting the average of the domain-specific features of the examples in . In addition, we compute the Gram matrix of the activations of the hidden convolutional layers in the feature extractor . The Gram matrix build a style representation that computes the correlations between different filter responses. The feature correlations are given by the Gram matrix , where is the inner product between the vectorised feature map between and :

(8)

where is the vectorised feature map of the hidden convolutional layers. Since the full Gram matrix is unmanageably large for the feature extractor based on deep neural networks, we make an approximation by only leveraging the entries in the subdiagonal, main diagonal, and superdiagonal of the Gram matrix . We utilize the concatenation of the prototype and the diagonals of the as the final embedding of domain .

Eliminating Sparsity The domain-specific feature and the Gram matrix are high sparsity data, which hampers the effectiveness of our Domain2Vec model. To address this issue, we leverage dimensionality reduction technique to decrease the dimensionality. Empirically, we start by using PCA to reduce the dimenionality of the data to a specific length. Then we leverage Stochastic Neighbor Embedding [tsne] to reduce the dimensionality to our desired one.

Optimization Our model is trained in an end-to-end fashion. We train the feature extractor , category and domain disentanglement component , MINE and the reconstructor

iteratively with Stochastic Gradient Descent 

[SGD] or Adam [Adam] optimizer. The overall optimization objective is:

(9)

where are the hyper-parameters, , denote the category disentanglement loss and domain disentanglement loss.

4 Experiments

We test Domain2Vec on two large-scale datasets we created. Our experiments aim to test both qualitative properties of the domain embedding and its performance on multi-source domain adaptation, openset domain adaptation and partial domain adaptation. In the main paper, we only report major results; more implementation details are provided in the supplementary material. Our Domain2Vec

 is implemented in the PyTorch platform. In the main paper, we only show the main experimental results, detailed experimental settings can be seen in the supplementary material.

4.1 Dataset

To evaluate the domain-to-vector mapping ability of our Domain2Vec model, a large-scale dataset with multiple domains is desired. Unfortunately, existing UDA benchmarks [office, officehome, domainnet, peng2017visda] only contain limited number of domains. These datasets provide limit benchmarking ability for our Domain2Vec model. To address this issue, we collect two datasets for multiple domain embedding and adaptation, i.e., TinyDA  and DomainBank.

TinyDA We create our by far the largest MNIST-style cross domain dataset to data, TinyDA. This dataset contains 54 domains and about one million MNIST-style training examples. We generate our TinyDA dataset by blending different foreground shapes over patches randomly extracted from background images. This operation is formally defined for two images , as , where are the coordinates of a pixel and is the channel index. The foreground shapes are from MNIST [mnist], USPS [usps], EMNIST [emnist], KMNIST [kmnist], QMNIST [qmnist], and FashionMNIST [fashionmnist]. Specifically, the MNIST, USPS, QMNIST contains digit images; EMNIST dataset includes images of MNIST-style English characters; KMNIST dataset is composed of images of Japanese characters; FashionMNIST dataset contains MNIST-style images about fashion. The background images are randomly cropped from CIFAR10 [cifar10] or BSDS500 [bsds500] dataset. We perform three different post-processes to our rendered images: (1) replace the background with black patch, (2) replace the background with white patch, (3) convert the images to grayscale. The three post-processes, together with the original foreground images and the generated color images, form a dataset with five different modes, i.e. White Background (WB), Black Background (BB), GrayScale image (GS), Color (Cr) image, Original image(Or).

DomainBankIn this dataset, the domain is defined by datasets. The data from different genres or times typically have different underlying distributions. To evaluate our Domain2Vec model on state-of-the-art computer vision datasets, we collect a large-scale benchmark, named DomainBank. The images of DomainBank  dataset are sampled from 56 existing popular computer vision datasets such as COCO [mscoco], CALTECH-256 [griffin2007caltech], PASCAL [pascal], VisDA [peng2017visda], DomainNet [domainnet], etc. We choose the dataset with different image modalities, illuminations, camera perspectives etc. to increase the diversity of the domains. In total, we collect 339,772 images with image-level and domain-level annotations. Different from TinyDA, the categories of different domains in DomainBank  are not identical. This property makes DomainBank a good testbed for Openset Domain Adaptation [busto2017openset, busto2018open] and Partial Domain Adaptation [cao2018partial].

KMNIST BSDS CIFAR WB BB Or Cr GS WB BB Cr GS WB 89.8 13.3 12.4 16.8 16.4 88.0 12.8 14.9 14.6 BB 12.5 94.1 94.3 32.9 30.4 11.5 92.6 23.3 22.2 Or 8.4 56.9 95.4 35.2 32.9 9.3 62.6 24.7 23.2 Cr 73.4 68.6 89.8 84.2 69.1 66.2 66.5 70.9 56.5 BSDS GS 72.7 64.0 87.9 67.4 74.1 68.7 66.7 55.1 59.4 WB 83.8 17.0 16.2 18.6 18.9 81.2 15.1 18.8 18.0 BB 13.1 90.0 91.2 26.0 24.1 11.8 88.8 18.8 17.9 Cr 66.5 65.8 85.3 81.4 68.8 61.6 65.7 76.1 65.7 KMNIST CIFAR GS 64.5 60.5 85.8 58.0 70.7 60.8 63.4 56.7 66.8 EMNIST BSDS CIFAR WB BB Or Cr GS WB BB Cr GS WB 86.6 2.9 2.8 8.1 8.6 83.2 5.1 6.9 7.5 BB 3.6 87.3 88.0 23.4 18.1 4.2 82.8 14.9 13.4 Or 12.0 31.1 91.3 33.4 32.2 11.1 33.6 21.1 21.2 Cr 59.1 47.0 85.8 80.0 60.8 47.9 42.0 60.0 42.7 BSDS GS 59.4 46.7 82.5 56.1 65.9 52.2 46.8 41.2 44.6 WB 87.8 13.9 4.5 15.3 16.7 86.1 12.2 13.0 13.6 BB 2.1 85.4 87.1 18.1 17.1 1.9 82.7 12.0 12.5 Cr 58.2 48.9 83.5 76.1 59.6 48.4 44.7 67.8 55.0 EMNIST CIFAR GS 46.6 46.5 81.1 48.1 63.2 43.8 48.8 45.3 57.4 FashionMNIST BSDS CIFAR WB BB Or Cr GS WB BB Cr GS WB 83.5 16.9 29.9 27.0 25.6 80.7 16.7 27.3 24.9 BB 23.6 84.5 85.4 38.1 36.6 21.1 81.7 28.9 28.9 Or 15.1 53.6 87.0 33.0 33.2 14.8 52.2 23.8 25.1 Cr 75.6 68.6 85.2 81.6 74.4 69.9 54.7 75.6 71.3 BSDS GS 72.3 66.3 83.5 71.5 77.6 70.2 61.9 69.5 73.2 WB 82.9 18.1 27.2 28.5 28.6 81.8 17.0 29.6 29.3 BB 21.1 84.8 86.2 29.1 28.4 18.1 82.3 22.1 23.3 Cr 75.1 67.9 85.1 82.2 75.6 72.4 62.4 78.6 76.6 FashionMNIST CIFAR GS 67.9 61.8 82.2 65.2 77.0 66.3 58.0 68.7 76.3
Table 1: Experimental results on TinyDA. The column-wise domains are source domains, the row-wise domains are the target domain.
(a) t-SNE Plot (b)

Domain Knowledge Graph

(c) Deep Domain Embedding
Figure 2: Deep domain embedding results of our Domain2Vec model on TinyDA dataset: (a) t-SNE plot of the embedding result (color indicates different domain); (b) Domain knowledge graph. The size and color of the circles indicate the number of training examples and the degree of that domain, respectively. The width of the edge shows the domain distance between two domains. (c) The final deep domain embedding of our Domain2Vec model. (Best viewed in color. Zoom in to see details.)

4.2 Experiments on TinyDA

Domain Embedding Results We apply our devised Domain2Vec model to TinyDA dataset to achieve deep domain embedding. The results are shown in Figure 2. Specifically, the domain knowledge graph shows the relations between different domains in a straightforward manner. The nodes in the graph show the deep domain embedding. For each domain, we connect it with five closest domains with a edge weighted by their domain distance. The size and the color of the nodes are correlated with the number of training images in that domain and the degree of that domain, respectively. To validate that the domain distance computed with our Domain2Vec is negatively correlated with the cross-domain performance, we conduct extensive experiments to calculate the cross-domain results on TinyDA dataset, as shown in Table 1. We split the cross-domain results in three sub-tables for Japanese characters (KMNIST), English characters (EMNIST) and fashion items (FashionMNIST), respectively. In each sub-table, the column-wise domains are selected as the source domain and the row-wise domains are selected as the target domain.

From the experimental results, we make three interesting observations. (i

) For each sub-table, the performances of training and testing on the same domain (gray background) are better than cross-domain performance, except a few outliers (pink background, mainly between MNIST, USPS, and QMNIST). (

ii) The cross-domain performance is negatively correlated with the domain distance (illustrated in Figure 2(b)). We leverage Pearson correlation coefficient (PCC) [benesty2009pearson] to quantitatively demonstrate the negative correlation. The PCC can be computed as . We set the cross-domain performance and the domain distance as two variables. The PCC that we compute for our case is -0.774, which demonstrates that our Domain2Vec successfully encodes the natural domain distance.

Standards Models MNISTUSPS MNISTQMNIST USPSMNIST USPSQMNIST QMNISTMNIST QMNISTUSPS Avg

 

Single Best Source Only 17.70.21 83.40.55 16.4 0.32 16.3 0.25 83.1 0.32 20.20.31 39.50.32
DAN [long2015] 21.40.27 87.10.64 19.70.37 19.90.34 85.70.34 21.80.37 42.60.39
RTN [RTN] 18.00.28 85.00.58 18.80.37 20.00.26 84.20.42 21.30.34 41.20.38
JAN [JAN] 21.70.27 87.60.64 19.40.42 18.00.29 87.20.36 25.10.33 43.20.39
DANN [DANN] 21.20.25 86.10.55 20.10.31 19.40.24 86.60.38 24.00.34 42.90.34
ADDA [adda] 20.30.31 88.10.63 18.30.46 21.40.38 88.50.39 25.90.43 43.80.43
SE [SE] 13.60.42 78.10.87 10.70.62 11.80.50 80.10.64 17.00.55 35.20.60
MCD [MCD_2018] 23.80.33 89.00.61 22.30.36 19.60.26 86.70.36 22.60.41 44.00.39

 

Source Combine Source Only 20.20.23 85.70.59 19.20.42 20.50.37 85.10.25 19.20.40 41.60.38
DAN [long2015] 19.80.30 85.40.64 22.40.43 21.90.49 88.00.33 19.20.48 42.80.45
RTN [RTN] 22.90.27 88.20.72 19.90.54 23.20.49 88.10.29 20.60.53 43.80.47
JAN [JAN] 21.80.29 88.10.59 22.20.50 23.90.45 89.50.36 22.30.46 44.60.44
DANN [DANN] 22.30.31 87.10.65 22.10.47 21.00.46 84.70.35 19.30.43 42.80.45
ADDA [adda] 25.20.24 87.90.61 20.50.46 22.00.36 88.10.25 20.70.49 44.10.40
SE [SE] 19.40.28 82.80.68 19.30.45 19.30.45 84.30.34 18.90.48 40.70.45
MCD [MCD_2018] 23.200.3 91.20.68 21.60.46 25.80.37 86.90.33 23.00.42 45.30.43

 

Multi- Source M3SDA [domainnet] 25.50.26 91.60.63 22.20.43 25.80.43 90.70.30 24.80.41 46.80.41
DCTN [xu2018deep] 25.50.28 93.100.7 22.90.41 29.50.47 91.20.29 26.50.48 48.10.44
Domain2Vec- 27.80.27 94.30.64 24.30.52 27.10.39 89.20.26 28.10.41 48.50.42
Domain2Vec- 28.20.31 94.50.63 27.60.41 29.30.39 91.50.26 27.20.42 49.70.40
Table 2: MSDA results on the TinyDA  dataset. Our model Domain2Vec and Domain2Vec achieves 48.5% and 49.7% accuracy, outperforming baselines. )

Multi-Source Domain Adaptation On TinyDA Our TinyDA dataset contains 54 domains. In our experiments, we consider the MSDA between digit datasets, i.e. MNIST, USPS, and QMNIST dataset, resulting in six MSDA settings. We choose the “grayscale” (GS) with CIFAR10 background as the target domain. For the source domains, we remove the two “grayscale” domains and leverage the remaining seven domains as the source domain.

State-of-the-art multi-source domain adaptation algorithms tackle MSDA task by adversarial alignment [xu2018deep] or matching the moments of the domains [domainnet]. However, these models neglect the effect of domain distance. We incorporate our Domain2Vec model to the previous work [xu2018deep, domainnet], and devise two models, Domain2Vec- and Domain2Vec-. Specifically, the Domain2Vec- borrows the moment matching [domainnet] idea and the training loss is weighted by the domain distance computed by our model. The Domain2Vec- is inspired by the adversarial learning [xu2018deep] and weights computed by our model is applied. Inspired by Xu et al [xu2018deep], we compare MSDA results with two other evaluation standards: (i) single best, reporting the single best-preforming source transfer result on the test set, and (ii) source combine, combining the source domains to a single domain and performing traditional single-source single target adaptation. The high-level motivations of these two evaluation schema are: the first standard evaluates whether MSDA can boost the best single source UDA results; the second standard testify whether MSDA can outperform the trivial baseline which combines the multiple source domains as a single domain.

For both single best and source combine experiment setting, we take the following methods as our baselines: Deep Alignment Network (DAN[long2015], Joint Adaptation Network (JAN[JAN], Domain Adversarial Neural Network (DANN[DANN], Residual Transfer Network (RTN[RTN], Adversarial Deep Domain Adaptation (ADDA[adda], Maximum Classifier Discrepancy (MCD[MCD_2018], and Self-Ensembling (SE[SE]. For multi-source domain adaptation, we take Deep Cocktail Network (DCTN[xu2018deep] and Moment Matching for Multi-source Domain Adaptation (MSDA[domainnet] as our baselines.

The experimental results are shown in Table 2. The Domain2Vec- and Domain2Vec- achieve an 48.5% and 49.7% average accuracy, outperforming other baselines. The results demonstrate that our models outperform the single best UDA results, the combine source results, and can boost the multi-source baselines. We argue that the performance improvement is due to the good domain embedding of our Domain2Vec model.

4.3 Experiments on DomainBank

Domain Embedding Results Similar to the experiments for TinyDA dataset, we apply our devised Domain2Vec model to DomainBank dataset. The results are shown in Figure 3. Since our DomainBank dataset is collected from multiple existing computer vision dataset, the categories of different domains in DomainBank  are not identical. It is not feasible to compute the cross-domain performance directly like TinyDA. However, we can still make the following interesting observations: (i) Domains with similar contents tend to form a cluster. For example, the domains containing buildings () are close to each other in terms of the domain distance. The domains containing faces share the same property. (ii) The domains which contains artistic images are scattered in the exterior side of the embedding and are distinct from the domains which contains images in the wild. For example, the “cartoon”,“syn”,“quickdraw”,“sketch”,“logo” domains are distributed in the exterior side of the embedding space. These observations demonstrate that our Domain2Vec model is capable of encoding the natural domain distance.

(a) t-SNE plot (b) Domain Knowledge Graph (c) Deep Domain Embedding
Figure 3: Domain embedding results of our Domain2Vec model on DomainBank  dataset.

 

Target VisDA Ytb BBox PASCAL COCO Average
Source Only 53.4 67.20.4 74.8 0.4 80.40.3 68.9
Openset SVM [jain2014multi] 53.90.5 68.60.4 77.70.4 82.10.4 70.6
AutoDIAL 54.20.5 68.10.5 75.90.4 83.40.4 70.4
AODA [aoda2018saito] 56.40.5 69.70.4 76.70.4 82.30.4 71.3
Domain2Vec 56.60.4 70.60.4 81.30.4 86.80.4 73.8

 

Table 3: Openset domain adaption on the DomainBank dataset.

Openset Domain Adaptation on DomainBank Open-set domain adaptation (ODA) considers classification when the target domain contains categories unknown (unlabelled) in the source domain. Our DomainBank dataset provides a good testbed for openset domain as the categories of different domains are not identical. Since DomainBank  contains 56 domains, it is infeasible to explore all the (source, target) domain combinations. Instead, in our work, we demo our model on the following four transfer setting: DomainNet [domainnet]VisDA [peng2017visda], DomainNetYoutube BBox [real2017youtube], DomainNetPASCAL [pascal], DomainNetCOCO. Specifically, DomainNet [domainnet] contains images with six distinct modalities and are fit to be a source domain for our openset domain adaptation.

The experimental results are shown in Table 3. The experimental results show that our model achieves 73.8% accuracy, outperforming the compared baselines.

Partial Domain Adaptation on DomainBank In partial domain adaptation, the source domain label space is a superset of the target domain label space. In consistent with the openset domain adaptation, we consider the following four partial domain adaptation setting: DomainNet [domainnet]VisDA [peng2017visda], DomainNetYoutube BBox [real2017youtube], DomainNetPASCAL [pascal], DomainNetCOCO.

The experimental results are shown in Table 4. Our model achieves 65.5% accuracy, outperforming the compared baselines. The experimental results demonstrate that our model can boost the performance in partial domain adaptation setting. Specifically, our model utilizes the idea of PADA [cao2018partial], which trains a partial adversarial alignment network to tackle the partial domain adaptation task. We compute the domain distance between the sub-domains in the source training data (DomainNet) and apply the domain distance as weight in the partial adversarial alignment process.

4.4 Ablation Study

Our model is composed of multiple component. To demonstrate the effectiveness of each component, we perform the ablation study analysis. Table 5 shows the ablation results on TinyDA dataset. We observe that the performance drops in most of the experiments when Mutual information minimization and Gram matrix information are not applied. The experimental results demonstrate the effectiveness of the mutual information minimization and Gram matrix information.

 

Target VisDA Ytb BBox PASCAL COCO Average
Source Only 34.5 74.30.4 68.2 0.3 76.40.2 63.3
AdaBN 35.10.5 75.60.5 68.20.4 78.10.4 64.2
AutoDIAL [cariucci2017autodial] 35.20.6 74.00.4 68.50.4 77.60.4 63.8
PADA [cao2018partial] 34.20.6 76.80.4 69.70.3 77.70.4 64.6
Domain2Vec 36.60.5 76.80.4 70.00.3 78.80.4 65.5

 

Table 4: Partial domain adaption on the DomainBank dataset.
target MNISTUSPS MNISTQMNIST USPSMNIST USPSQMNIST Avg
D2V 28.20.31 94.50.63 27.60.41 29.30.39 44.9
D2V w/o. Gram 28.50.29 92.40.56 25.50.29 27.70.26 43.5
D2V w/o. Mutual 26.70.27 94.10.49 27.90.35 27.40.41 44.0
target VisDA Ytb BBox PASCAL COCO Avg VisDA Ytb BBox PASCAL COCO Avg
D2V 56.6 70.6 81.3 86.8 73.8 36.6 76.8 70.0 78.8 65.5
D2V w/o. Gram 54.5 68.4 80.5 85.4 72.2 34.5 77.1 65.4 77.9 63.7
D2V w/o. Mutual 55.2 69.3 81.4 85.7 72.9 35.4 73.5 67.8 77.5 63.5
Table 5: The ablation study results show that the Mutual information minimizing and Gram matrix information is essential for our model. The above table shows ablation experiments performed on the TinyDA dataset. The table below shows ablation experiments on DomainBank dataset (openset DA on the left, partial DA on the right).

5 Conclusion

In this paper, we have proposed a novel learning paradigm to explore the natural relations between different domains. We introduced the deep domain embedding task and proposed Domain2Vec to achieve domain-to-vector mapping with joint learning of Gram Matrix of the latent representations and feature disentanglement. We have collected and evaluated two state-of-the-art domain adaptation datasets, TinyDA and DomainBank. These two datasets are challenging due to the presence of notable domain gaps and a large number of domains. Extensive experiments has been conducted, both qualitatively and quantitatively, on the two benchmarks we collected to demonstrate the effectiveness of our proposed model. We also show that our model can facilitate multi-source domain adaptation, openset domain adaptation and partial domain adaptation. We hope the learning schema we proposed and the benchmarks we collected will be beneficial for the future domain adaptation research.

Acknowledgements

We thank the anonymous reviewers for their comments and suggestions. This work was partially supported by NSF and Honda Research Institute.

References

6 Supplementary Material

The appendix is organized as follows: Section A shows the comparison of our two datasets with the state-of-the-art cross-domain datasets. Section B describes the details of generating the TinyDA dataset. Section C shows the detailed information about DomainBank dataset. Section D introduces the detailed network framework for experiments on TinyDA dataset. Section E shows the additional experimental analysis. Section F shows the category information in the openset domain adaptation experiments in Section 4.3.

A Comparison to modern datasets

Dataset Year Images Classes Domains Description
Digit-Five - 100,000 10 5 digit
Office [office] 2010 4,110 31 3 office
Office-Caltech [gong2012geodesic] 2012 2,533 10 4 office
CAD-Pascal [peng2015learning] 2015 12,000 20 6 animal,vehicle
Office-Home [officehome] 2017 15,500 65 4 office, home
PACS [PACS] 2017 9,991 7 4 animal, stuff
Open MIC [openmic] 2018 16,156 - - museum
Syn2Real [syn2real] 2018 280,157 12 3 animal,vehicle
DomainNet [domainnet] 2019 569,010 345 6 clipart,sketch
TinyDA (Ours) - 965,619 10 or 26 54 tiny images
DomainBank (Ours) - 339,772 - 55 dataset
Table 6: A collection of most notable datasets to evaluate domain adaptation methods. Specifically, “Digit-Five” dataset indicates five most popular digit datasets (MNIST [lecun1998gradient], MNIST-M [DANN], Synthetic Digits [DANN], SVHN, and USPS) which are widely used to evaluate domain adaptation models. Our dataset is challenging as it contains more images and domains than other datasets.

B TinyDA  Generation

The images from TinyDA  dataset are generated by blending different foreground shapes over patches randomly extracted from background images. In the first step, we select a foreground shape from the following five MNIST-style datasets: MNIST [mnist], USPS [usps], EMNIST [emnist], KMNIST [kmnist], QMNIST [qmnist], and FashionMNIST [fashionmnist]. Secondly, we choose a background pattern from the CIFAR10 [cifar10] dataset or randomly cropped from BSD500 [bsds500] dataset. Thirdly, we perform three different post-process to our rendered images: (1) replace the background with black patch, (2) replace the background with white path, (3) convert the generated images to grayscale images. These three post-processes, together with the original foreground images and the generated color images, form a dataset with five different modes, i.e. Black Background (BB), White Background (WB), GrayScale image (GS), Color (Cr) image, Original image(Or). In total, we generate a dataset with 54 domains and about one million MNIST-style training examples.

The image examples of our TinyDA dataset are shown in Table 7. Specifically, the upper and below table show the images generated with backgrounds from BSDS500 [bsds500] and CIFAR10 [cifar10], respectively. The image number of each domain in TinyDA dataset can be seen from Table 8.

Figure 4: Generation configuration for TinyDA  dataset. We create our TinyDA dataset with six foregrounds, two backgrounds and five modes. The foreground images are from MNIST [mnist], USPS [usps], EMNIST [emnist], KMNIST [kmnist], QMNIST [qmnist], FashionMNIST [fashionmnist]. The background images are randomly sampled from CIFAR10 [cifar10] or randomly cropped from BSDS500 [bsds500] dataset. The five modes include “Black Background”, “White Background”, “Color”, “GrayScale”, and “Original”.

.

FG/Mode Black BG White BG Color Grayscale Original

 

MNIST
USPS
FashionMNIST
KMNIST
QMNIST
EMNIST

 

MNIST
USPS
FashionMNIST
KMNIST
QMNIST
EMNIST
Table 7: Illustration of TinyDA dataset. We create our TinyDA dataset with six foregrounds, two backgrounds, and five modes. The upper and below table show the images generated with backgrounds from BSDS500 [bsds500] and CIFAR10 [cifar10], respectively.
FG/Mode Black BG White BG Color Grayscale Original

 

MNIST 40,000 40,000 40,000 40,000 20,000
USPS 14,582 14,582 14,582 14,582 7291
FashionMNIST 40,000 40,000 40,000 40,000 20,000
KMNIST 40,000 40,000 40,000 40,000 20,000
QMNIST 40,000 40,000 40,000 40,000 20,000
EMNIST 40,000 40,000 40,000 40,000 20,000
Table 8: Number of images in each domain of TinyDA dataset.
ID Image Samples ID Image Samples

 

1 2
3 4
5 6
7 8
9 10
11 12
13 14
15 16
17 18
19 20
21 22
23 24
25 26
27 28
29 30
31 32
33 34
35 36
37 38
39 40
41 42
43 44
45 46
47 48
49 50
51 52
53 54
55 56
Table 9: Illustration of DomainBank dataset. The ID is this table is corresponding to the ID in Table 10.

C DomainBank  Dataset

The images of DomainBank  dataset are sampled from 56 existing popular computer vision datasets. We choose the dataset with different image modalities, illuminations, camera perspectives etc. to increase the diversity of the domains. More details about our DomainBank benchmark are shown in Table 9 and Table 10. In total, we collect 339,772 images with image-level and domain-level annotations. Different from TinyDA, the categories of different domains in DomainBank  are not identical. This property makes DomainBank a good testbed for Openset Domain Adaptation and Partial Domain Adaptation.

ID Dataset Name Image# Description ID Dataset Name Image# Description
1 CUFSF  [ucfsf] 1,194 face_sketch 2 COCO  [mscoco] 10,000 real
3 PASCAL  [pascal] 10,000 realh 4 DomainNet  [domainnet] 10,000 real
5 SYNTHIA  [synthia] 10,000 street_syn 6 UIUC CAR  [UIUC_CAR] 1,220 car
7 ZuDuB  [ZuBuD] 210 building 8 Bark-101  [bark] 2,586 bark
9 DomainNet  [domainnet] 10,000 sketch 10 Open-MIC  [openmic] 10,000 indoor
11 DomainNet  [domainnet] 10,000 clipart 12 Caltech256  [griffin2007caltech] 10,000 real
13 Ped. Detection  [enzweiler2008monocular] 10,000 pedestrian 14 Traffic Sign  [mathias2013traffic] 4,053 traffic
15 UKBench  [nister2006scalable] 10,000 indoor_stuff 16 Oxford Flower  [nilsback2006visual] 8,189 flower
17 Caltech Games  [aly2009towards] 7,660 game_cover 18 Oxford Buildings  [philbin2007object] 5,063 building
19 GFW Face  [he2018merge] 3,236 face 20 Driving  [udacity] 9,420 road
21 MegaAge  [huang2016unsupervised] 10,000 face 22 ADE20K  [zhou2019semantic] 10,000 indoor
23 Ped. Color  [cheng2016pedestrian] 10,000 pedestrian 24 LabelMeFacade  [labelmefacade] 395 building
25 UT Zappos50K  [finegrained] 10,000 shoes 26 TRANCOS  [TRANCOSdataset_IbPRIA2015] 1,244 traffic
27 FGVC  [TRANCOSdataset_IbPRIA2015] 10,000 aeroplane 28 Mall Dataset  [change2013semi] 2,000 mall
29 Chars74K  [de2009character] 7,705 character 30 DomainNet  [domainnet] 10,000 painting
31 Paris Dataset  [philbin2008lost] 3,187 street 32 DomainNet  [domainnet] 10,000 infograph
33 DroneDataset  [dronedataset] 400 drone 34 Boxy  [boxy2019] 2,148 road
35 Stanford Car  [stanfordcar] 8,144 car 36 DeepFashion2  [DeepFashion2] 10,000 fashion
37 ExDark  [Exdark] 6,619 dark 38 LaMem  [ICCV15_Khosla] 10,000 memorial
39 Stanford Dog  [KhoslaYaoJayadevaprakashFeiFei_FGVC2011] 10,000 dog 40 Cartoon Set  [royer2020xgan] 9,999 cartoon
41 DomainNet  [domainnet] 10,000 quick_draw 42 Football  [kazemi2012using] 771 football
43 Sketch Objects  [eitz2012hdhso] 10,000 sketch 44 CUB200  [welinder2010caltech] 10,000 bird
45 CITY-OSM  [kaiser2017learning] 914 drone_view 46 Arch Style  [xu2014architectural] 4,630 building
47 UCM Land  [yang2010bag] 2,100 satellite 48 Privacy Attribute  [orekondy17iccv] 4,157 stuff
49 IMDB-WIKI  [Rothe-IJCV-2018] 10,000 face 50 Street View  [6710175] 6,594 street
51 PPSS  [luo2013pedestrian] 1,458 pedestrian 52 Sketch Retrieval  [eitz2011sbir] 1,213 sketch
53 VisDA  [peng2017visda] 10,000 syn 54 GTA  [Richter_2016_ECCV] 5,000 syn
55 Youtube BBox  [real2017youtube] 10,000 real 56 Logo-2k+  [wang2019logo] 10,000 logo
Table 10: Detailed information about our DomainBank dataset.

D Model architecture

The detailed network architecture for TinyDA dataset is shown in Table 11.

 

layer configuration

 

Feature Generator

 

1 Conv2D (3, 64, 5, 1, 2), BN, ReLU, MaxPool
2 Conv2D (64, 64, 5, 1, 2), BN, ReLU, MaxPool
3 Conv2D (64, 128, 5, 1, 2), BN, ReLU

 

Disentangler

 

1 FC (8192, 3072), BN, ReLU
2 DropOut (0.5), FC (3072, 2048), BN, ReLU

 

Domain Classifier

 

1 FC (2048, 256), LeakyReLU
2 FC (256, 56), LeakyReLU

 

Classifier

 

1 FC (2048, 10 or 26), BN, Softmax

 

Reconstructor

 

1 FC (4096, 8192)

 

Mutual Information Estimator

 

fc1_x FC (2048, 512)
fc1_y FC (2048, 512), LeakyReLU
2 FC (512,1)

 

Table 11: Model Architecture for experiments on TinyDA

  dataset. For each convolution layer, we list the input dimension, output dimension, kernel size, stride, and padding. For the fully-connected layer, we provide the input and output dimensions. For drop-out layers, we provide the probability of an element to be zeroed.

E Additional experimental results

(a) t-SNE Plot by BG (b) t-SNE Plot by FG (c) t-SNE Plot by Mode (d) Deep Embedding
Figure 5: Deep domain embedding results of our Domain2Vec model on TinyDA dataset: (a) t-SNE plot of the embedding result (color indicates different background); (b)t-SNE plot of the embedding result (color indicates different foreground); (c) t-SNE plot of the embedding result (color indicates different mode); (d) Deep embedding result. (Best viewed in color. Zoom in to see details.)
(a) t-SNE Plot by Domain (b) Deep Embedding of DomainBank
Figure 6: Deep domain embedding results of our Domain2Vec model on DomainBank dataset: (a) t-SNE plot of the embedding result (color indicates different domain); (d) Deep embedding result. (Best viewed in color. Zoom in to see details.)

F Category information

For openset domain adaptation experiments in Section 4.3, we choose the “aeroplane”, “bus”, “horse”, “motorcycle”, “plant”, “train”, and “truck” as the common categories across the four domains. We set “bicycle”, “car” “knife”, “person”, “skateboard” as the unknown categories.