DeepAI
Log In Sign Up

Modified Distribution Alignment for Domain Adaptation with Pre-trainedInception ResNet

Deep neural networks have been widely used in computer vision. There are several well trained deep neural networks for the ImageNet classification challenge, which has played a significant role in image recognition. However, little work has explored pre-trained neural networks for image recognition in domain adaption. In this paper, we are the first to extract better-represented features from a pre-trained Inception ResNet model for domain adaptation. We then present a modified distribution alignment method for classification using the extracted features. We test our model using three benchmark datasets (Office+Caltech-10, Office-31, and Office-Home). Extensive experiments demonstrate significant improvements (4.8 accuracy over the state-of-the-art.

READ FULL TEXT VIEW PDF

page 4

page 5

02/06/2020

Impact of ImageNet Model Selection on Domain Adaptation

Deep neural networks are widely used in image classification problems. H...
04/27/2021

Efficient Pre-trained Features and Recurrent Pseudo-Labeling inUnsupervised Domain Adaptation

Domain adaptation (DA) mitigates the domain shift problem when transferr...
03/22/2022

A Broad Study of Pre-training for Domain Generalization and Adaptation

Deep models must learn robust and transferable representations in order ...
09/28/2021

Evaluation of Deep Neural Network Domain Adaptation Techniques for Image Recognition

It has been well proved that deep networks are efficient at extracting f...
06/30/2020

A Simple Domain Shifting Networkfor Generating Low Quality Images

Deep Learning systems have proven to be extremely successful for image r...
01/06/2022

A Light in the Dark: Deep Learning Practices for Industrial Computer Vision

In recent years, large pre-trained deep neural networks (DNNs) have revo...
10/29/2020

Speech-Image Semantic Alignment Does Not Depend on Any Prior Classification Tasks

Semantically-aligned (speech, image) datasets can be used to explore "vi...

Code Repositories

I Introduction

With the rapid development of social media and content sharing applications, data grows much faster than we can make sense of it. There is great demand for automatic classification and analysis for text, images, and other multimedia data [1]

. However, it is time-consuming and expensive to acquire enough labeled data to train models. Therefore, it is valuable to learn a model for a new target domain from a different domain with abundant labeled samples. Mechanisms for learning feature representations of a continuous intermediate space from one domain to another has been widely used in many fields such as machine learning

[2], language processing [3], and computer vision [4]. There are several techniques to address this problem; a prominent one is domain adaption [5, 6, 7]. There have been efforts for both semi-supervised [8, 9, 10] and unsupervised [11, 12, 13]

domain adaptation. In the first case, the target domain contains a small amount of labeled data; for the latter case, the target domain is entirely unlabeled. Usually the labeled target data alone is insufficient to construct a good classifier. Thus, how to effectively leverage sufficient label source data to facilitate unlabeled target data is key to domain adaptation.

However, a critical challenge remains: to find and identify useful features that span the representations of two domains. The quality of such features will directly affect classification accuracy. We cannot expect to train a high-quality classifier if the learned features are poor. Therefore, it is essential to find a proper way to represent the source and target data.

One useful working model for feature representation is based on manifold learning, which learns the intermediate features between the source and the target domain via a Grassmannian manifold. Gopalan et al. [4] proposed a sampling geodesic flow (SGF) method to learn the intermediate features between the source and the target domain via the geodesic (shortest path) on Grassmannian manifold. However, Gong et al. [5] have noted that it is difficult to choose an optimal sampling strategy. Moreover, SGF has high time complexity making sampling slow when many points are needed. Gong et al. [5] proposed a geodesic flow kernel (GFK) model to overcome the limitations of unknown sampling size in SGF. They integrated all samples along the “geodesic”, which is calculated from Gopalan et al. [4]. We show that the “geodesic” is not the true geodesic. Several works have addressed the alignment of marginal distribution and conditional distribution of data in domain adaption. Wang and Mahadevan aligned the source and target domain by preserving the ‘neighborhood structure’ of the data points [14]. Wang et al. proposed a manifold embedding distribution alignment method (based on work of Gong et al. [5]) to align both the degenerate feature transformation and the unevaluated distributions of both domains [15]. However, none of these models explore the quality of the learned features.

Deep learning models are also widely applied to domain adaptation [7, 16, 17, 18, 19, 20, 21, 20]

. Stacked Denoising Autoencoders is one of the first deep models for domain adaptation, and aims to find the common features between the source and target domain via denoising autoencoders

[22]. The deep neural network for domain adaptation can be majorly classified in four types: discrepancy-based methods, adversarial discriminative models, adversarial generative models, and data reconstruction-based models. One of the first discrepancy-based methods is Deep Domain Confusion (DDC), which considers the discrepancy in different layers and the network is fine-tuned based on maximum mean discrepancy (MMD) [7]. Later Long et al. [23] proposed a Deep Adaptation Network (DAN) that considered the sum of MMD from several layers with several kernels of MMD functions. The Domain adaptive neural network also embedded MMD as a regularization [24]. Adversarial discriminative based models aim to define a domain confusion objective to identify the domains via a domain discriminator. The Domain-Adversarial Neural Networks (DANN) consider a minimax loss to integrate a gradient reversal layer to promote the discrimination of source and target domain [25]. The Adversarial Discriminative Domain Adaptation (ADDA) uses an inverted label GAN loss to split the source and target domain, and features can be learned separately [17]. The adversarial generative models combine the discriminative model with generative components based on Generative Adversarial Networks (GANs) [26]. The Coupled Generative Adversarial Networks [27] consists of a series of GANs, and each of them can represent one of the domains. Data reconstruction-based methods jointly learn source label predictions and unsupervised target data reconstruction [28].

However, training of deep neural networks consume time and require much effort to tune the parameters. We are inspired by Zhang et al. [29]

, which extracted features from the well-trained Alexnet, and then trained an SVM using the deep features to facilitate improvements in classification accuracy. Also, other work indicated that the features extracted from the activation layers of a well-trained deep neural network could be re-used for different tasks even when the new tasks are different from the original tasks used to train the model

[30].

In this paper, we first extract features from a well-trained Inception ResNet-v2 (IR) model; we then classify these features based on a modified distribution alignment. Our contributions are three-fold:

  1. We create three datasets for domain adaptation based on better extracted features, which can be of significant value in future research for the community.

  2. We show the shortcomings of the original manifold embedded distribution alignment method, and propose a modified distribution alignment for classification, which enhances the accuracy for classification.

  3. We test these improvements using three benchmark datasets. Extensive experiments demonstrate significant improvements (4.8%, 5.5%, and 10%) in classification accuracy over the state-of-the-art.

Ii Problem Statement

To avoid the complex and time-consuming process of hand-tuning parameters for training a deep neural network, we present the extraction of features from a well-trained deep neural network, so that we are able to learn a better feature representation of source and target domain data. Also, we want to further align the distribution from both source and target domain.

Given training data (source domain): , with its labels , denoting the categories, and the test data (target domain): with its labels and , that implies that we might not have all labels for testing data. If , which means we have sufficient labels for , we aim to get a higher predictive accuracy. If , we not only want to get a high enough predictive accuracy, but also to predict the labels for the unlabeled data. We have two concerns: 1) how to generate better source and target features for the image recognition problem; 2) how to improve prediction accuracy using the features of step 1.

Iii Method

Fig. 1: The scheme of MDAIR model. 1) We first extract the feature from the last fully connected layer in Inception-ResNet-v2 model. The learned features are slightly more aligned than the raw features; 2) We then align the distribution of learned features.

Iii-a Feature Extraction

Feature extraction is a relatively easy and fast way to take advantage of deep learning without investing time and much effort into training a full neural network. Feature extraction will be especially useful if we do not use GPUs since it only requires a single pass over the input images. Kornblith et al. indicated that ResNets are often the best feature extractors, independently of their ImageNet accuracies [31]

. In this paper, we use Inception-ResNet-v2 as the pre-trained model from which to extract features. Inception-ResNet-v2 is a powerful convolution neural network, which is trained on more than one million images from the ImageNet datasets. This network consists of 164 layers (the largest number of convolutional and fully connected layers from the input layer to the output layer). IR model can predict 1000 categories of images, such as cup, smart phone, backpack, and many animals. Therefore, IR model has learned rich feature representations with a wide range of images. The image input size of IR model is 299-by-299-by-3. Please refer to

[32] for details of Inception-ResNet-v2 model.

As shown in Fig. 2, we compare the number of parameters and top-1 accuracy of several well-trained deep neural networks (SqueezeNet [33] , AlexNet [34], VGG16 [35], VGG19 [35], GoogLeNet [36], ResNet18 [37], ResNet50, ResNet101, ResNet152 [37], DenseNet201 [38], Inceptionv3 [39], Inception-Resent-V2 [32]). There are two essential reasons why we choose the IR model as the deep neural network to extract features. First, the top-1 accuracy of Inception-ResNet-v2 model is higher than other models. Secondly, the IR model uses fewer parameters compared with several lower accuracy networks (e.g. VGG-16).

We assume that extracted features from the IR model contain more detailed information than other features, which will enable a classifier to achieve higher accuracy. We then compare extracted IR features with three commonly used sets of features (SURF, Resnet-50, and DeCAF), which is shown in Sec. IV-B. In addition, the extracted features from different layers will have different effects on final recognition results, which is also shown in Sec. IV-B. Fig. 3 is an example of extracting features using the well-trained Inception-ResNet-v2 model. The left of Fig. 2(a) is the input image, Fig. 2(b) is the extracted features from the first conventional layer in the IR model; the right of Fig. 2(a) is strongest channel feature in Fig. 2(b); and Fig. 2(c) is extracted feature from last fully connected layer. Alg. 1 describes the procedures of extracting features from the pre-trained IR model.

Fig. 2: The top 1 accuracy and number of parameters of different pre-trained deep neural networks.
0:  Raw images and pre-trained Inception-ResNet-v2 model
0:  Extracted features from IR model
1:  Prepare the images (rescale the size of images to be )
2:  Select one layer to extract features
3:

  Apply the raw features of a datapoint as input, and use activation functions to extract the feature using IR model in the selected layer

Algorithm 1 Extracting features from IR model
(a) Original image and strongest channel
(b) The activation of first convolutional layer
(c) Final extracted feature
Fig. 3: Original image and extracted features of first convolutional layer in Inception-ResNet-V2 model.

Iii-B Distribution Alignment

To train a robust classifier for features, which were extracted in the previous section, we perform dynamic distribution alignment to quantitatively account for the relative importance of marginal and conditional distribution to address the challenge of unevaluated distribution alignment.

Manifold Embedded Distribution Alignment (MEDA) is proposed by Wang et al. [15] to align learned features from manifold learning. It has three fundamental steps: 1) learn features from the manifold based on Gong et al. [5]

; 2) use dynamic distribution alignment to estimate the marginal and conditional distributions of data; and, 3) update the classified labels based on estimated parameters. Please refer to Wang et al. 

[15] for more details. The classifier () is defined as:

(1)

where represents kernel Hilbert space;

is the loss function;

is a feature learning function in Grassmannian manifold [5]; is the learned features from IR model, is the squared norm of ; represents the dynamic distribution alignment; is a Laplacian regularization; , , and are regularization parameters. Here, the term is the structure risk minimization (SRM). We can only employ the SRM on , since there are few labels (perhaps no labels) for . By training the classifier from Eq. 1, we can predict labels of test data.

Iii-C Weaknesses of MEDA

The first step of the MEDA method is learning the kernel mapping from Grassmannian manifold based on GFK model. However, the calculation of “geodesic” in GFK model is originally from SGF method, which is a unevaluated geodesic [4]. Then GFk considered all samples points on “geodesic” for constructing a kernel function. It is a “kernel trick”; but it cannot maintain the true information from a manifold since geodesic is not correctly estimated. We design two experiments to show the defects of GFK.

Given two points and on the sphere, we want to recover all other points between them. As shown in Figure 4, sampled points of the SGF method (yellow curve is not able to recover the true points on a geodesic (cyan curve). Therefore, the GFK model will lose feature information if it integrates all pseudo samples from wrong geodesic (yellow curve), which is calculated using the SGF method.

Fig. 4: The comparison of SGF samples and ground truth. Two black points are the given points; the cyan curve highlights the true geodesic points; the yellow curve is the sampling results of SGF. Sampled points are away from the true geodesic in SGF model.

We design another experiment to show shape deformation using SGF model. As shown in Fig. 5, the source image is a square (the leftmost of Fig. 4(a)), and the target image is a circle (the rightmost of Fig. 4(a)). The progress of sampled images of the SGF model are shown in Fig. 4(b). To evaluate the quality of samples, there are two criteria. The sample should be similar to the source image when , and the sample should be similar to the target image when . However, the sampled images of SGF model are far from the source and target images when and , respectively.

There are two issues in the sampled images of the SGF method: first, its background is dark; this is caused by the Log map not being correctly calculated in Gopalan et al. [4] (there are some negations of the estimated velocity between the source image and target image). The second is that the shape is never unified, and this is caused by the Exp does not approach the target at [4].

(a) The progress of true samples
(b) The progress of SGF samples
Fig. 5: The comparison of sampling results between the two images (square and circle) with . Obviously, the SGF model does not generate a correct sample in (b). For reference, the source image is the far left at and the target image is far away at in Fig. 4(a).
(a) Office+Caltech-10
(b) Office-31
(c) Office-Home
Fig. 6: Some example images from three benchmark datasets. (a) is from the DSLR domain in Office+Caltech-10 dataset; (b) is from the Amazon domain in Office-31 dataset, and (c) is from Art domain in Office-Home dataset.

The second shortcoming of GFK is that the dimensionality is difficult to determine. The first step of GFK is to project the original source and target data into a subspace since the number of instances in the original space is not the same (). The reduced dimensionality will lead to information loss of original data.

Iii-D Modified Distribution Alignment

To resolve the issues mentioned above, we use the original features instead of features from the GFK model. These are two essential reasons: 1) we want to maintain the information of original features, and we want to avoid the undetermined dimensionality in the GFK model; 2) the extracted IR features contain enough detailed information for the classification problem111Source code is available at: https://github.com/heaventian93/MDAIR.. Therefore, we have the following objective function:

(2)

We only need to replace the manifold learning feature in line 1 of Alg.1 in Wang et al. [15] with our extracted IR features to get the modified distribution alignment model.

Iv Results

Iv-a Description of Datasets

In this experiment, we show how our MDAIR method can enhance image recognition accuracy. We test our model using three public image datasets: Office+Caltech-10 (we combine Office-10 and Caltech-10 as one dataset), Office-31, and Office-Home [40, 15, 41]. These datasets are widely used in many publications [4, 5, 15], and are the benchmarking data for evaluating the performance of domain adaptation algorithms. Table I lists the statistics of these datasets. In the Office+Caltech-10 datasets and Office-31 dataset, there are totally four domains (A, W, C, and D) where A represents Amazon, W represents Webcam, C represents Caltech and D represents DSLR. In the Office-Home dataset, A represents Arts, C represents Clipart, P represents Product and R represents Real world. C A means learning from existing domain C, and transferring knowledge to classify domain A.

Dataset Sample Feature Class Domain(s)
Office-10 1410 1000 10 A, W, D
Caltech-10 1123 1000 10 C
Office-31 1330 1000 31 A, W, D
Office-Home 15588 1000 65 A, C, P, R
TABLE I: Statistics of extracted IR features for four benchmark datasets

Fig. 6 shows example images from three benchmark datasets. Amazon and Caltech images are mostly from online merchants, while DSLR and Webcam images are mostly from offices [5]. We also combine Office-10 and Caltech-10 to be one dataset, and we perform twelve tasks in this dataset: C A, C W, , D W. In Office-31 dataset, we have another six tasks: A W, A D, , D W. For Office-Home datasets, we have another twelve tasks: A C, A P, , R P. Therefore, we have a total of 30 tasks in our experiment.

Layers Last average pooling Last fully connected Classification
Feature 1536 1000 1000
TABLE II: Statistics of extracted IR features from different layers.
Fig. 7: Differences in accuracy of extracted features three layers (average pooling, fully connected, and classification) for the Office+Caltech-10 dataset tasks, where the baseline is the fully connected layer. The accuracy from the fully connected layer is better than other layers—all accuracies from the classification layer are below the fully connected layer, and most accuracies of the average pooling layer are below the fully connected layer. Therefore, we suggest extracting features from the last fully connected layer.
Fig. 8: The t-SNE view of the comparison of our IR features (a, d and g) with DeCAF (b and e), Resnet-50 (h), and SURF features (c and f). Different color means different classes. The first row is from DSLR domain in Office+Caltech-10 datasets, and the second row is from the Webcam domain in the Office-31 dataset, and (g) and (h) are from the Art domain in Office-Home dataset.
Task C A C W C D A C A W A D W C W A W D D C D A D W Average
TCA 77 80.7 84.7 82.2 68.1 72.6 79.3 86.4 88.5 82.2 86.4 84.7 81.1
ITCA 81 65.8 79.6 82.9 70.8 79 78.2 85.5 92.4 77.9 82.5 90.5 80.5
SSTCA 79.6 70.5 80.9 76.5 72.5 83.4 69.9 79.5 90.4 78.7 85.2 87.8 79.6
TJM 86.7 84.7 86 82.8 78.3 86 82 86 100 83.8 89.6 99.3 87.1
BDA 89.5 78.6 81.5 79.6 73.2 84.7 78.1 83.3 100 79.7 88.5 98.6 84.6
JDA 88.4 84.4 85.4 81.6 80.7 81.5 82.2 89.8 100 86 91.5 99.3 87.6
SVM 91 78 85.4 83.3 72.5 83.4 62.9 72.1 99.4 65 78.2 96.6 80.7
GFK 88.8 77.3 86 77.4 66.8 79 72 76.5 100 75.5 84.7 99 81.9
JGSA 91.4 86.8 93.6 84.9 81.0 88.5 85.0 90.7 100 86.2 92.0 99.7 90.0
ARTL 92.4 87.8 86.6 87.4 88.5 85.4 88.2 92.3 100 87.3 92.7 100 90.7
MEDA 93 91.2 89.8 89 90.8 88.5 89 92.2 99.4 88.6 93.2 98.6 91.9
AlexNet 91.9 83.7 87.1 83 79.5 87.4 73 83.8 100 79 87.1 97.7 86.1
DAN 92 90.6 89.3 84.1 91.8 91.7 81.2 92.1 100 80.3 90 98.5 90.1
DDC 91.9 85.4 88.8 85 86.1 89 78 83.8 100 79 87.1 97.7 86.1
DCORAL 89.8 97.3 91 91.9 100 90.5 83.7 81.5 90.1 88.6 80.1 92.3 89.7
MEDA-IR 96.2 95.9 96.2 95.2 98 96.8 94.5 96.2 99.4 93.8 95.5 98.6 96.4
MDAIR 96.1 94.9 96.2 94.2 98.6 100 94.9 96.3 100 94.2 95.8 98.6 96.7
TABLE III: Accuracy (%) on Office + Caltech-10 datasets
Task TCA SSTCA MEDA DAN RTN DANN ADDA CAN JDDA JAN MEDA-IR MDAIR
A W 82.6 81 83.3 80.5 84.5 82 86.2 81.5 82.6 85.4 90.8 94
A D 84.1 78.7 83.3 78.6 77.5 79.7 77.8 65.9 79.8 84.7 91.4 92.6
W A 69.1 68.9 66.2 62.8 64.8 67.4 68.9 98.2 66.7 70.0 74.6 77.6
W D 99.6 99.6 96 99.6 99.4 99.1 98.4 85.5 99.7 99.8 97.2 99.2
D A 66.1 66.6 66.7 63.6 66.2 68.2 69.5 99.7 57.4 68.6 75.4 78.7
D W 97 97.4 91.7 97.1 96.8 96.9 96.2 63.4 95.2 97.4 96 96.9
Average 83.1 82.0 81.2 80.4 81.6 82.2 82.9 82.4 80.2 84.3 87.5 89.8
TABLE IV: Accuracy (%) on Office-31 datasets
Task A C A P A R C A C P C R P A P C P R R A R C R P Average
AlexNet 26.4 32.6 41.3 22.1 41.7 42.1 20.5 20.3 51.1 31 27.9 54.9 34.3
VGG16 30.4 45.9 57.5 35.4 48.7 50.8 35.8 30.5 60.2 49.6 34.5 64.0 45.3
D-CORAL 32.2 40.5 54.5 31.5 45.8 47.3 30.0 32.3 55.3 44.7 42.8 59.4 42.8
RTN 31.3 40.2 54.6 32.5 46.6 48.3 28.2 32.9 56.4 45.5 44.8 61.3 43.5
DAH 31.6 40.8 51.7 34.7 51.9 52.8 29.9 39.6 60.7 45.0 45.1 62.5 45.5
MDDA 35.2 44.4 57.2 36.8 52.5 53.7 34.8 37.2 62.2 50.0 46.3 66.1 48.0
ResNet-50 34.9 50 58 37.4 41.9 46.2 38.5 31.2 60.4 53.9 41.2 59.9 46.1
DAN 43.6 57 67.9 45.8 56.5 60.4 44 43.6 67.7 63.1 51.5 74.3 56.3
DANN 45.6 59.3 70.1 47 58.5 60.9 46.1 43.7 68.5 63.2 51.8 76.8 57.6
JAN 45.9 61.2 68.9 50.4 59.7 61 45.8 43.4 70.3 63.9 52.4 76.8 58.3
CDAN-RM 49.2 64.8 72.9 53.8 62.4 62.9 49.8 48.8 71.5 65.8 56.4 79.2 61.5
CDAN-M 50.6 65.9 73.4 55.7 62.7 64.2 51.8 49.1 74.5 68.2 56.9 80.7 62.8
MEDA-IR 52.9 79.3 78.9 67.3 78.8 78.8 68.2 53.4 79.8 71.8 56.3 83 70.7
MDAIR 55.6 80.4 81.6 70.2 80.7 80.8 71 55.6 82.5 73.5 57.7 83.9 72.8
TABLE V: Accuracy (%) on Office-Home datasets

Iv-B Feature Comparison

To determine the best layer for feature extraction, we first explore the effect of different layers in final accuracy. We list the number of features from three layers in Tab. II. Based on an experiment using the Office+Caltech-10 dataset, we choose the optimal layer. Fig. 7 shows the accuracy of different tasks from different layers. Accuracy from the fully connected layer is typically higher than the other two layers. Therefore, we suggest that last fully connected layer is the best layer to extract features in domain adaption problem.

We then examine the quality of our IR features in the last fully connected layer. We visualize the three domains from three datasets using the t-SNE technique. T-SNE (t-distributed Stochastic Neighbor Embedding) [42]

is an algorithm for visualizing high-dimensional data by re-representing it in a lower dimensional space. t-SNE generates a low-dimensional representation in which points near each other are similar in the high-dimensional space and vice versa. The better that clusters are separated in the t-SNE view, the better the extracted features are likely to be. The loss function of the t-SNE method is Kullback-Leibler divergence, which measures the difference between the two distributions

[42]. Typically, lower losses correspond to better features.

As shown in Fig. 8, our extracted IR features produce better clearer and better separated clusters than SURF, DeCAF, and Resnet-50 features. Therefore, we can assume that our IR features will lead to better classification result than the others. Similarly, the visualizations based on our IR features have the lowest loss values.

Iv-C Comparison to State-of-the-art Methods

We compare the performance of our MDAIR model with 25 state-of-the-art (both traditional and deep learning) methods: Transfer Component Analysis (TCA) [8]; Global and Local Metrics for Domain Adaptation (IGLDA also called ITCA) [19]; Semi-supervised TCA (SSTCA) [8]; Transfer Joint Matching (TJM) [43]; Balanced distribution adaptation (BDA) [44]

; Joint distribution alignment (JDA)

[6]

; Support Vector Machine (SVM)

[10]; Geodesic Flow Kernel (GFK) [5]; Adaptation Regularization (ARTL) [45]; Joint Geometrical and Statistical Alignment (JGSA) [46]; Manifold Embedded Distribution Alignment (MEDA) [15]; AlexNet [34]; VGG-16 [35]; Deep Adaptation Networks (DAN) [23]; Deep Domain Confusion (DDC) [7]; Deep Correlation Alignment (DCORAL) [16]; Joint Adaptation Networks (JAN) [18]; Residual Transfer Networks (RTN) [47]; Domain Adaptive Neural Networks (DANN) [48]; Domain Adaptive Hashing (DAH) [49]; Minimum Discrepancy Deep Adaptation (MDDA) [41]; Adversarial Discriminative Domain Adaptation (ADDA) [17]; Collaborative Adversarial Network (CAN) [21], Joint Discriminative Domain Adaptation (JDDA) [20], and Conditional Domain Adversarial Networks (CDAN-RM, CDAN-M) [50].

From Tables III, IV and V, we can observe that the accuracy of MDAIR model is ahead of all other methods in most tasks (23/30). Notably, our model always achieves the best performance in Office-Home dataset. Regarding all three datasets, the overall average performance is significantly improved over the best state-of-the-art baseline methods. The results of using SURF feature are too low to compare with DeCAF and IR features and are omitted.

To illustrate the effectiveness of our model, we consider the case in which all models use our IR features, and view the prediction results using t-SNE. Focusing on the AD task in which the accuracy of our MDAIR is 100% (and thus identical to ground truth), Fig. 9 shows that all other conventional methods contained mixed colors in the t-SNE view. These results indicate that our modified distribution alignment is better than several baseline methods even using the same features. In addition, we test our IR features using the original MEDA method (MEDA-IR in Tables III, IV and V); results still turn out that our modified distribution alignment is better than the previous MEDA model.

Fig. 9: T-SNE view of the comparison of baseline methods and the proposed MDAIR model in the A D in Office+Caltech-10 dataset. The proposed MDAIR model has the highest accuracy, while all other methods have some mixed colors, which implied the classes are wrongly classified (as colors correspond to labels).

Iv-D Parameter Settings

In our experiments, the optimal parameters for different tasks might be different. To more easily reproduce our results, we use consistent parameters: , , , and .

V Discussion

Task Best baseline MDAIR Improvement
Office+Caltech-10 91.9 96.7 4.8%
Office-31 84.3 89.8 5.5%
Office-Home 62.8 72.8 10%
TABLE VI: Comparison of average accuracy of the best baseline method and our MDAIR model

We list the improvement of our model based on the best state-of-the-art methods in Table VI. For three datasets (Office+Caltech-10, Office-31, and Office-Home), our method improves the absolute accuracy by 4.8%, 5.5%, and 10% respectively. Therefore, the quality of our model exceeds that of all the state-of-the-art methods.

There are two prominent reasons for the success of our model. First of all, our model takes advantage of deep features from the Inception-ResNet-v2 model, which produces better features than SURF and DeCAF features. And better features reduce the difference between the source and target domains. Secondly, the modified distribution alignment facilitate the alignment of the distribution of features which leads to higher accuracy.

In addition, our experiments imply that the last fully connected layer is the best layer for feature extraction. A likely reason is that the layer collects all features from the previous layer; hence it will form better features than previous layer. Although the last classification layer can be used for feature extraction from the IR model, the performance is worse than the last fully connected layer since features from classification layer will be affected by original trained classes. We observe that our model is compromised in some tasks (A W in Office+Caltech-10 and D A in Office-31 dataset). This caused by the intrinsic differences of various datasets, and so we cannot guarantee that our model always beats all other methods.

However, one shallow weakness of our model is that feature extraction affects the results significantly. We suggest that extracting feature from higher top-1 accuracy deep neural networks will further improve the accuracy.

Vi Conclusion

In this paper, we are the first to extract features from a pre-trained Inception-ResNet-v2 model for the domain adaption problem. The experiment shows that the last fully connected layer is the best layer to extract features and the extracted features are better than DeCAF and Resnet-50 features. The modified distribution alignment model has a better performance than other models. We also test our model using three benchmark datasets. Extensive experiments demonstrate significant improvements in classification accuracy over the state-of-the-art.

There are some obvious areas for follow-up work. Extracting features from another well-trained deep neural network might generate a better input for the modified distribution alignment than the IR model. Testing on a broader set of unsupervised learning tasks will improve the applicability of our model. Also, a new distribution method will be beneficial for increasing the predictive accuracy.

References

  • [1] Long Chen, Hanwang Zhang, Jun Xiao, Wei Liu, and Shih-Fu Chang. Zero-shot visual recognition using semantics-preserving adversarial embedding network. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , volume 2, 2018.
  • [2] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems, pages 137–144, 2007.
  • [3] John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proc. 45th Annual Meeting of the Assoc. of Computational Linguistics, pages 440–447, 2007.
  • [4] Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. Domain adaptation for object recognition: An unsupervised approach. In IEEE International Conference on Computer Vision (ICCV), pages 999–1006. IEEE, 2011.
  • [5] Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic flow kernel for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2066–2073. IEEE, 2012.
  • [6] Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S Yu. Transfer feature learning with joint distribution adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2200–2207, 2013.
  • [7] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
  • [8] Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang Yang. Domain adaptation via transfer component analysis. IEEE Trans. on Neural Networks, 22(2):199–210, 2011.
  • [9] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine Learning, 79(1-2):151–175, 2010.
  • [10] Alessandro Bergamo and Lorenzo Torresani. Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach. In Advances in Neural Information Processing Systems, pages 181–189, 2010.
  • [11] John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman. Learning bounds for domain adaptation. In Advances in Neural Information Processing Systems, pages 129–136, 2008.
  • [12] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. arXiv preprint arXiv:0902.3430, 2009.
  • [13] Shai Ben-David, Tyler Lu, Teresa Luu, and Dávid Pál. Impossibility theorems for domain adaptation. In

    International Conference on Artificial Intelligence and Statistics

    , pages 129–136, 2010.
  • [14] Chang Wang and Sridhar Mahadevan. Manifold alignment without correspondence. In IJCAI, volume 2, pages 1273–1278, 2009.
  • [15] Jindong Wang, Wenjie Feng, Yiqiang Chen, Han Yu, Meiyu Huang, and Philip S Yu. Visual domain adaptation with manifold embedded distribution alignment. In 2018 ACM Multimedia Conference on Multimedia Conference, pages 402–410. ACM, 2018.
  • [16] Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. In European Conference on Computer Vision, pages 443–450. Springer, 2016.
  • [17] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7167–7176, 2017.
  • [18] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan.

    Deep transfer learning with joint adaptation networks.

    In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2208–2217. JMLR. org, 2017.
  • [19] Min Jiang, Wenzhen Huang, Zhongqiang Huang, and Gary G Yen. Integration of global and local metrics for domain adaptation learning via dimensionality reduction. IEEE transactions on cybernetics, 47(1):38–51, 2017.
  • [20] Chao Chen, Zhihong Chen, Boyuan Jiang, and Xinyu Jin. Joint domain alignment and discriminative feature learning for unsupervised deep domain adaptation. arXiv preprint arXiv:1808.09347, 2018.
  • [21] Weichen Zhang, Wanli Ouyang, Wen Li, and Dong Xu. Collaborative and adversarial network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3801–3809, 2018.
  • [22] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM, 2008.
  • [23] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. Learning transferable features with deep adaptation networks. arXiv preprint arXiv:1502.02791, 2015.
  • [24] Muhammad Ghifary, W Bastiaan Kleijn, Mengjie Zhang, and David Balduzzi. Domain generalization for object recognition with multi-task autoencoders. In Proceedings of the IEEE international conference on computer vision, pages 2551–2559, 2015.
  • [25] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
  • [26] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [27] Ming-Yu Liu and Oncel Tuzel. Coupled generative adversarial networks. In Advances in neural information processing systems, pages 469–477, 2016.
  • [28] Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. Domain separation networks. In Advances in Neural Information Processing Systems, pages 343–351, 2016.
  • [29] Youshan Zhang, Jon-Patrick Allem, Jennifer Beth Unger, and Tess Boley Cruz. Automated identification of hookahs (waterpipes) on instagram: an application in feature extraction using convolutional neural network and support vector machine classification. Journal of medical Internet research, 20(11), 2018.
  • [30] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pages 647–655, 2014.
  • [31] Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better? arXiv preprint arXiv:1805.08974, 2018.
  • [32] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi.

    Inception-v4, inception-resnet and the impact of residual connections on learning.

    In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
  • [33] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
  • [34] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [35] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [36] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
  • [37] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [38] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
  • [39] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
  • [40] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models to new domains. In European Conference on Computer Vision, pages 213–226. Springer, 2010.
  • [41] Mohammad Mahfujur Rahman, Clinton Fookes, Mahsa Baktashmotlagh, and Sridha Sridharan. On minimum discrepancy estimation for deep domain adaptation. arXiv preprint arXiv:1901.00282, 2019.
  • [42] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
  • [43] Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S Yu. Transfer joint matching for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1410–1417, 2014.
  • [44] Jindong Wang, Yiqiang Chen, Shuji Hao, Wenjie Feng, and Zhiqi Shen. Balanced distribution adaptation for transfer learning. In Data Mining (ICDM), 2017 IEEE International Conference on, pages 1129–1134. IEEE, 2017.
  • [45] Mingsheng Long, Jianmin Wang, Guiguang Ding, Sinno Jialin Pan, and S Yu Philip. Adaptation regularization: A general framework for transfer learning. IEEE Transactions on Knowledge and Data Engineering, 26(5):1076–1089, 2014.
  • [46] Jing Zhang, Wanqing Li, and Philip Ogunbona. Joint geometrical and statistical alignment for visual domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1859–1867, 2017.
  • [47] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Unsupervised domain adaptation with residual transfer networks. In Advances in Neural Information Processing Systems, pages 136–144, 2016.
  • [48] Muhammad Ghifary, W Bastiaan Kleijn, and Mengjie Zhang. Domain adaptive neural networks for object recognition. In Pacific Rim international conference on artificial intelligence, pages 898–904. Springer, 2014.
  • [49] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5018–5027, 2017.
  • [50] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems, pages 1647–1657, 2018.