MDAIR
None
view repo
Deep neural networks have been widely used in computer vision. There are several well trained deep neural networks for the ImageNet classification challenge, which has played a significant role in image recognition. However, little work has explored pre-trained neural networks for image recognition in domain adaption. In this paper, we are the first to extract better-represented features from a pre-trained Inception ResNet model for domain adaptation. We then present a modified distribution alignment method for classification using the extracted features. We test our model using three benchmark datasets (Office+Caltech-10, Office-31, and Office-Home). Extensive experiments demonstrate significant improvements (4.8 accuracy over the state-of-the-art.
READ FULL TEXT VIEW PDFNone
With the rapid development of social media and content sharing applications, data grows much faster than we can make sense of it. There is great demand for automatic classification and analysis for text, images, and other multimedia data [1]
. However, it is time-consuming and expensive to acquire enough labeled data to train models. Therefore, it is valuable to learn a model for a new target domain from a different domain with abundant labeled samples. Mechanisms for learning feature representations of a continuous intermediate space from one domain to another has been widely used in many fields such as machine learning
[2], language processing [3], and computer vision [4]. There are several techniques to address this problem; a prominent one is domain adaption [5, 6, 7]. There have been efforts for both semi-supervised [8, 9, 10] and unsupervised [11, 12, 13]domain adaptation. In the first case, the target domain contains a small amount of labeled data; for the latter case, the target domain is entirely unlabeled. Usually the labeled target data alone is insufficient to construct a good classifier. Thus, how to effectively leverage sufficient label source data to facilitate unlabeled target data is key to domain adaptation.
However, a critical challenge remains: to find and identify useful features that span the representations of two domains. The quality of such features will directly affect classification accuracy. We cannot expect to train a high-quality classifier if the learned features are poor. Therefore, it is essential to find a proper way to represent the source and target data.
One useful working model for feature representation is based on manifold learning, which learns the intermediate features between the source and the target domain via a Grassmannian manifold. Gopalan et al. [4] proposed a sampling geodesic flow (SGF) method to learn the intermediate features between the source and the target domain via the geodesic (shortest path) on Grassmannian manifold. However, Gong et al. [5] have noted that it is difficult to choose an optimal sampling strategy. Moreover, SGF has high time complexity making sampling slow when many points are needed. Gong et al. [5] proposed a geodesic flow kernel (GFK) model to overcome the limitations of unknown sampling size in SGF. They integrated all samples along the “geodesic”, which is calculated from Gopalan et al. [4]. We show that the “geodesic” is not the true geodesic. Several works have addressed the alignment of marginal distribution and conditional distribution of data in domain adaption. Wang and Mahadevan aligned the source and target domain by preserving the ‘neighborhood structure’ of the data points [14]. Wang et al. proposed a manifold embedding distribution alignment method (based on work of Gong et al. [5]) to align both the degenerate feature transformation and the unevaluated distributions of both domains [15]. However, none of these models explore the quality of the learned features.
Deep learning models are also widely applied to domain adaptation [7, 16, 17, 18, 19, 20, 21, 20]
. Stacked Denoising Autoencoders is one of the first deep models for domain adaptation, and aims to find the common features between the source and target domain via denoising autoencoders
[22]. The deep neural network for domain adaptation can be majorly classified in four types: discrepancy-based methods, adversarial discriminative models, adversarial generative models, and data reconstruction-based models. One of the first discrepancy-based methods is Deep Domain Confusion (DDC), which considers the discrepancy in different layers and the network is fine-tuned based on maximum mean discrepancy (MMD) [7]. Later Long et al. [23] proposed a Deep Adaptation Network (DAN) that considered the sum of MMD from several layers with several kernels of MMD functions. The Domain adaptive neural network also embedded MMD as a regularization [24]. Adversarial discriminative based models aim to define a domain confusion objective to identify the domains via a domain discriminator. The Domain-Adversarial Neural Networks (DANN) consider a minimax loss to integrate a gradient reversal layer to promote the discrimination of source and target domain [25]. The Adversarial Discriminative Domain Adaptation (ADDA) uses an inverted label GAN loss to split the source and target domain, and features can be learned separately [17]. The adversarial generative models combine the discriminative model with generative components based on Generative Adversarial Networks (GANs) [26]. The Coupled Generative Adversarial Networks [27] consists of a series of GANs, and each of them can represent one of the domains. Data reconstruction-based methods jointly learn source label predictions and unsupervised target data reconstruction [28].However, training of deep neural networks consume time and require much effort to tune the parameters. We are inspired by Zhang et al. [29]
, which extracted features from the well-trained Alexnet, and then trained an SVM using the deep features to facilitate improvements in classification accuracy. Also, other work indicated that the features extracted from the activation layers of a well-trained deep neural network could be re-used for different tasks even when the new tasks are different from the original tasks used to train the model
[30].In this paper, we first extract features from a well-trained Inception ResNet-v2 (IR) model; we then classify these features based on a modified distribution alignment. Our contributions are three-fold:
We create three datasets for domain adaptation based on better extracted features, which can be of significant value in future research for the community.
We show the shortcomings of the original manifold embedded distribution alignment method, and propose a modified distribution alignment for classification, which enhances the accuracy for classification.
We test these improvements using three benchmark datasets. Extensive experiments demonstrate significant improvements (4.8%, 5.5%, and 10%) in classification accuracy over the state-of-the-art.
To avoid the complex and time-consuming process of hand-tuning parameters for training a deep neural network, we present the extraction of features from a well-trained deep neural network, so that we are able to learn a better feature representation of source and target domain data. Also, we want to further align the distribution from both source and target domain.
Given training data (source domain): , with its labels , denoting the categories, and the test data (target domain): with its labels and , that implies that we might not have all labels for testing data. If , which means we have sufficient labels for , we aim to get a higher predictive accuracy. If , we not only want to get a high enough predictive accuracy, but also to predict the labels for the unlabeled data. We have two concerns: 1) how to generate better source and target features for the image recognition problem; 2) how to improve prediction accuracy using the features of step 1.
Feature extraction is a relatively easy and fast way to take advantage of deep learning without investing time and much effort into training a full neural network. Feature extraction will be especially useful if we do not use GPUs since it only requires a single pass over the input images. Kornblith et al. indicated that ResNets are often the best feature extractors, independently of their ImageNet accuracies [31]
. In this paper, we use Inception-ResNet-v2 as the pre-trained model from which to extract features. Inception-ResNet-v2 is a powerful convolution neural network, which is trained on more than one million images from the ImageNet datasets. This network consists of 164 layers (the largest number of convolutional and fully connected layers from the input layer to the output layer). IR model can predict 1000 categories of images, such as cup, smart phone, backpack, and many animals. Therefore, IR model has learned rich feature representations with a wide range of images. The image input size of IR model is 299-by-299-by-3. Please refer to
[32] for details of Inception-ResNet-v2 model.As shown in Fig. 2, we compare the number of parameters and top-1 accuracy of several well-trained deep neural networks (SqueezeNet [33] , AlexNet [34], VGG16 [35], VGG19 [35], GoogLeNet [36], ResNet18 [37], ResNet50, ResNet101, ResNet152 [37], DenseNet201 [38], Inceptionv3 [39], Inception-Resent-V2 [32]). There are two essential reasons why we choose the IR model as the deep neural network to extract features. First, the top-1 accuracy of Inception-ResNet-v2 model is higher than other models. Secondly, the IR model uses fewer parameters compared with several lower accuracy networks (e.g. VGG-16).
We assume that extracted features from the IR model contain more detailed information than other features, which will enable a classifier to achieve higher accuracy. We then compare extracted IR features with three commonly used sets of features (SURF, Resnet-50, and DeCAF), which is shown in Sec. IV-B. In addition, the extracted features from different layers will have different effects on final recognition results, which is also shown in Sec. IV-B. Fig. 3 is an example of extracting features using the well-trained Inception-ResNet-v2 model. The left of Fig. 2(a) is the input image, Fig. 2(b) is the extracted features from the first conventional layer in the IR model; the right of Fig. 2(a) is strongest channel feature in Fig. 2(b); and Fig. 2(c) is extracted feature from last fully connected layer. Alg. 1 describes the procedures of extracting features from the pre-trained IR model.
Apply the raw features of a datapoint as input, and use activation functions to extract the feature using IR model in the selected layer
![]() |
![]() |
![]() |
To train a robust classifier for features, which were extracted in the previous section, we perform dynamic distribution alignment to quantitatively account for the relative importance of marginal and conditional distribution to address the challenge of unevaluated distribution alignment.
Manifold Embedded Distribution Alignment (MEDA) is proposed by Wang et al. [15] to align learned features from manifold learning. It has three fundamental steps: 1) learn features from the manifold based on Gong et al. [5]
; 2) use dynamic distribution alignment to estimate the marginal and conditional distributions of data; and, 3) update the classified labels based on estimated parameters. Please refer to Wang et al.
[15] for more details. The classifier () is defined as:(1) | ||||
where represents kernel Hilbert space;
is the loss function;
is a feature learning function in Grassmannian manifold [5]; is the learned features from IR model, is the squared norm of ; represents the dynamic distribution alignment; is a Laplacian regularization; , , and are regularization parameters. Here, the term is the structure risk minimization (SRM). We can only employ the SRM on , since there are few labels (perhaps no labels) for . By training the classifier from Eq. 1, we can predict labels of test data.The first step of the MEDA method is learning the kernel mapping from Grassmannian manifold based on GFK model. However, the calculation of “geodesic” in GFK model is originally from SGF method, which is a unevaluated geodesic [4]. Then GFk considered all samples points on “geodesic” for constructing a kernel function. It is a “kernel trick”; but it cannot maintain the true information from a manifold since geodesic is not correctly estimated. We design two experiments to show the defects of GFK.
Given two points and on the sphere, we want to recover all other points between them. As shown in Figure 4, sampled points of the SGF method (yellow curve is not able to recover the true points on a geodesic (cyan curve). Therefore, the GFK model will lose feature information if it integrates all pseudo samples from wrong geodesic (yellow curve), which is calculated using the SGF method.
We design another experiment to show shape deformation using SGF model. As shown in Fig. 5, the source image is a square (the leftmost of Fig. 4(a)), and the target image is a circle (the rightmost of Fig. 4(a)). The progress of sampled images of the SGF model are shown in Fig. 4(b). To evaluate the quality of samples, there are two criteria. The sample should be similar to the source image when , and the sample should be similar to the target image when . However, the sampled images of SGF model are far from the source and target images when and , respectively.
There are two issues in the sampled images of the SGF method: first, its background is dark; this is caused by the Log map not being correctly calculated in Gopalan et al. [4] (there are some negations of the estimated velocity between the source image and target image). The second is that the shape is never unified, and this is caused by the Exp does not approach the target at [4].
![]() |
![]() |
![]() |
![]() |
![]() |
The second shortcoming of GFK is that the dimensionality is difficult to determine. The first step of GFK is to project the original source and target data into a subspace since the number of instances in the original space is not the same (). The reduced dimensionality will lead to information loss of original data.
To resolve the issues mentioned above, we use the original features instead of features from the GFK model. These are two essential reasons: 1) we want to maintain the information of original features, and we want to avoid the undetermined dimensionality in the GFK model; 2) the extracted IR features contain enough detailed information for the classification problem111Source code is available at: https://github.com/heaventian93/MDAIR.. Therefore, we have the following objective function:
(2) | ||||
We only need to replace the manifold learning feature in line 1 of Alg.1 in Wang et al. [15] with our extracted IR features to get the modified distribution alignment model.
In this experiment, we show how our MDAIR method can enhance image recognition accuracy. We test our model using three public image datasets: Office+Caltech-10 (we combine Office-10 and Caltech-10 as one dataset), Office-31, and Office-Home [40, 15, 41]. These datasets are widely used in many publications [4, 5, 15], and are the benchmarking data for evaluating the performance of domain adaptation algorithms. Table I lists the statistics of these datasets. In the Office+Caltech-10 datasets and Office-31 dataset, there are totally four domains (A, W, C, and D) where A represents Amazon, W represents Webcam, C represents Caltech and D represents DSLR. In the Office-Home dataset, A represents Arts, C represents Clipart, P represents Product and R represents Real world. C A means learning from existing domain C, and transferring knowledge to classify domain A.
Dataset | Sample | Feature | Class | Domain(s) |
---|---|---|---|---|
Office-10 | 1410 | 1000 | 10 | A, W, D |
Caltech-10 | 1123 | 1000 | 10 | C |
Office-31 | 1330 | 1000 | 31 | A, W, D |
Office-Home | 15588 | 1000 | 65 | A, C, P, R |
Fig. 6 shows example images from three benchmark datasets. Amazon and Caltech images are mostly from online merchants, while DSLR and Webcam images are mostly from offices [5]. We also combine Office-10 and Caltech-10 to be one dataset, and we perform twelve tasks in this dataset: C A, C W, , D W. In Office-31 dataset, we have another six tasks: A W, A D, , D W. For Office-Home datasets, we have another twelve tasks: A C, A P, , R P. Therefore, we have a total of 30 tasks in our experiment.
Layers | Last average pooling | Last fully connected | Classification |
---|---|---|---|
Feature | 1536 | 1000 | 1000 |
Task | C A | C W | C D | A C | A W | A D | W C | W A | W D | D C | D A | D W | Average |
TCA | 77 | 80.7 | 84.7 | 82.2 | 68.1 | 72.6 | 79.3 | 86.4 | 88.5 | 82.2 | 86.4 | 84.7 | 81.1 |
ITCA | 81 | 65.8 | 79.6 | 82.9 | 70.8 | 79 | 78.2 | 85.5 | 92.4 | 77.9 | 82.5 | 90.5 | 80.5 |
SSTCA | 79.6 | 70.5 | 80.9 | 76.5 | 72.5 | 83.4 | 69.9 | 79.5 | 90.4 | 78.7 | 85.2 | 87.8 | 79.6 |
TJM | 86.7 | 84.7 | 86 | 82.8 | 78.3 | 86 | 82 | 86 | 100 | 83.8 | 89.6 | 99.3 | 87.1 |
BDA | 89.5 | 78.6 | 81.5 | 79.6 | 73.2 | 84.7 | 78.1 | 83.3 | 100 | 79.7 | 88.5 | 98.6 | 84.6 |
JDA | 88.4 | 84.4 | 85.4 | 81.6 | 80.7 | 81.5 | 82.2 | 89.8 | 100 | 86 | 91.5 | 99.3 | 87.6 |
SVM | 91 | 78 | 85.4 | 83.3 | 72.5 | 83.4 | 62.9 | 72.1 | 99.4 | 65 | 78.2 | 96.6 | 80.7 |
GFK | 88.8 | 77.3 | 86 | 77.4 | 66.8 | 79 | 72 | 76.5 | 100 | 75.5 | 84.7 | 99 | 81.9 |
JGSA | 91.4 | 86.8 | 93.6 | 84.9 | 81.0 | 88.5 | 85.0 | 90.7 | 100 | 86.2 | 92.0 | 99.7 | 90.0 |
ARTL | 92.4 | 87.8 | 86.6 | 87.4 | 88.5 | 85.4 | 88.2 | 92.3 | 100 | 87.3 | 92.7 | 100 | 90.7 |
MEDA | 93 | 91.2 | 89.8 | 89 | 90.8 | 88.5 | 89 | 92.2 | 99.4 | 88.6 | 93.2 | 98.6 | 91.9 |
AlexNet | 91.9 | 83.7 | 87.1 | 83 | 79.5 | 87.4 | 73 | 83.8 | 100 | 79 | 87.1 | 97.7 | 86.1 |
DAN | 92 | 90.6 | 89.3 | 84.1 | 91.8 | 91.7 | 81.2 | 92.1 | 100 | 80.3 | 90 | 98.5 | 90.1 |
DDC | 91.9 | 85.4 | 88.8 | 85 | 86.1 | 89 | 78 | 83.8 | 100 | 79 | 87.1 | 97.7 | 86.1 |
DCORAL | 89.8 | 97.3 | 91 | 91.9 | 100 | 90.5 | 83.7 | 81.5 | 90.1 | 88.6 | 80.1 | 92.3 | 89.7 |
MEDA-IR | 96.2 | 95.9 | 96.2 | 95.2 | 98 | 96.8 | 94.5 | 96.2 | 99.4 | 93.8 | 95.5 | 98.6 | 96.4 |
MDAIR | 96.1 | 94.9 | 96.2 | 94.2 | 98.6 | 100 | 94.9 | 96.3 | 100 | 94.2 | 95.8 | 98.6 | 96.7 |
Task | TCA | SSTCA | MEDA | DAN | RTN | DANN | ADDA | CAN | JDDA | JAN | MEDA-IR | MDAIR |
A W | 82.6 | 81 | 83.3 | 80.5 | 84.5 | 82 | 86.2 | 81.5 | 82.6 | 85.4 | 90.8 | 94 |
A D | 84.1 | 78.7 | 83.3 | 78.6 | 77.5 | 79.7 | 77.8 | 65.9 | 79.8 | 84.7 | 91.4 | 92.6 |
W A | 69.1 | 68.9 | 66.2 | 62.8 | 64.8 | 67.4 | 68.9 | 98.2 | 66.7 | 70.0 | 74.6 | 77.6 |
W D | 99.6 | 99.6 | 96 | 99.6 | 99.4 | 99.1 | 98.4 | 85.5 | 99.7 | 99.8 | 97.2 | 99.2 |
D A | 66.1 | 66.6 | 66.7 | 63.6 | 66.2 | 68.2 | 69.5 | 99.7 | 57.4 | 68.6 | 75.4 | 78.7 |
D W | 97 | 97.4 | 91.7 | 97.1 | 96.8 | 96.9 | 96.2 | 63.4 | 95.2 | 97.4 | 96 | 96.9 |
Average | 83.1 | 82.0 | 81.2 | 80.4 | 81.6 | 82.2 | 82.9 | 82.4 | 80.2 | 84.3 | 87.5 | 89.8 |
Task | A C | A P | A R | C A | C P | C R | P A | P C | P R | R A | R C | R P | Average |
AlexNet | 26.4 | 32.6 | 41.3 | 22.1 | 41.7 | 42.1 | 20.5 | 20.3 | 51.1 | 31 | 27.9 | 54.9 | 34.3 |
VGG16 | 30.4 | 45.9 | 57.5 | 35.4 | 48.7 | 50.8 | 35.8 | 30.5 | 60.2 | 49.6 | 34.5 | 64.0 | 45.3 |
D-CORAL | 32.2 | 40.5 | 54.5 | 31.5 | 45.8 | 47.3 | 30.0 | 32.3 | 55.3 | 44.7 | 42.8 | 59.4 | 42.8 |
RTN | 31.3 | 40.2 | 54.6 | 32.5 | 46.6 | 48.3 | 28.2 | 32.9 | 56.4 | 45.5 | 44.8 | 61.3 | 43.5 |
DAH | 31.6 | 40.8 | 51.7 | 34.7 | 51.9 | 52.8 | 29.9 | 39.6 | 60.7 | 45.0 | 45.1 | 62.5 | 45.5 |
MDDA | 35.2 | 44.4 | 57.2 | 36.8 | 52.5 | 53.7 | 34.8 | 37.2 | 62.2 | 50.0 | 46.3 | 66.1 | 48.0 |
ResNet-50 | 34.9 | 50 | 58 | 37.4 | 41.9 | 46.2 | 38.5 | 31.2 | 60.4 | 53.9 | 41.2 | 59.9 | 46.1 |
DAN | 43.6 | 57 | 67.9 | 45.8 | 56.5 | 60.4 | 44 | 43.6 | 67.7 | 63.1 | 51.5 | 74.3 | 56.3 |
DANN | 45.6 | 59.3 | 70.1 | 47 | 58.5 | 60.9 | 46.1 | 43.7 | 68.5 | 63.2 | 51.8 | 76.8 | 57.6 |
JAN | 45.9 | 61.2 | 68.9 | 50.4 | 59.7 | 61 | 45.8 | 43.4 | 70.3 | 63.9 | 52.4 | 76.8 | 58.3 |
CDAN-RM | 49.2 | 64.8 | 72.9 | 53.8 | 62.4 | 62.9 | 49.8 | 48.8 | 71.5 | 65.8 | 56.4 | 79.2 | 61.5 |
CDAN-M | 50.6 | 65.9 | 73.4 | 55.7 | 62.7 | 64.2 | 51.8 | 49.1 | 74.5 | 68.2 | 56.9 | 80.7 | 62.8 |
MEDA-IR | 52.9 | 79.3 | 78.9 | 67.3 | 78.8 | 78.8 | 68.2 | 53.4 | 79.8 | 71.8 | 56.3 | 83 | 70.7 |
MDAIR | 55.6 | 80.4 | 81.6 | 70.2 | 80.7 | 80.8 | 71 | 55.6 | 82.5 | 73.5 | 57.7 | 83.9 | 72.8 |
To determine the best layer for feature extraction, we first explore the effect of different layers in final accuracy. We list the number of features from three layers in Tab. II. Based on an experiment using the Office+Caltech-10 dataset, we choose the optimal layer. Fig. 7 shows the accuracy of different tasks from different layers. Accuracy from the fully connected layer is typically higher than the other two layers. Therefore, we suggest that last fully connected layer is the best layer to extract features in domain adaption problem.
We then examine the quality of our IR features in the last fully connected layer. We visualize the three domains from three datasets using the t-SNE technique. T-SNE (t-distributed Stochastic Neighbor Embedding) [42]
is an algorithm for visualizing high-dimensional data by re-representing it in a lower dimensional space. t-SNE generates a low-dimensional representation in which points near each other are similar in the high-dimensional space and vice versa. The better that clusters are separated in the t-SNE view, the better the extracted features are likely to be. The loss function of the t-SNE method is Kullback-Leibler divergence, which measures the difference between the two distributions
[42]. Typically, lower losses correspond to better features.As shown in Fig. 8, our extracted IR features produce better clearer and better separated clusters than SURF, DeCAF, and Resnet-50 features. Therefore, we can assume that our IR features will lead to better classification result than the others. Similarly, the visualizations based on our IR features have the lowest loss values.
We compare the performance of our MDAIR model with 25 state-of-the-art (both traditional and deep learning) methods: Transfer Component Analysis (TCA) [8]; Global and Local Metrics for Domain Adaptation (IGLDA also called ITCA) [19]; Semi-supervised TCA (SSTCA) [8]; Transfer Joint Matching (TJM) [43]; Balanced distribution adaptation (BDA) [44]
; Joint distribution alignment (JDA)
[6]; Support Vector Machine (SVM)
[10]; Geodesic Flow Kernel (GFK) [5]; Adaptation Regularization (ARTL) [45]; Joint Geometrical and Statistical Alignment (JGSA) [46]; Manifold Embedded Distribution Alignment (MEDA) [15]; AlexNet [34]; VGG-16 [35]; Deep Adaptation Networks (DAN) [23]; Deep Domain Confusion (DDC) [7]; Deep Correlation Alignment (DCORAL) [16]; Joint Adaptation Networks (JAN) [18]; Residual Transfer Networks (RTN) [47]; Domain Adaptive Neural Networks (DANN) [48]; Domain Adaptive Hashing (DAH) [49]; Minimum Discrepancy Deep Adaptation (MDDA) [41]; Adversarial Discriminative Domain Adaptation (ADDA) [17]; Collaborative Adversarial Network (CAN) [21], Joint Discriminative Domain Adaptation (JDDA) [20], and Conditional Domain Adversarial Networks (CDAN-RM, CDAN-M) [50].From Tables III, IV and V, we can observe that the accuracy of MDAIR model is ahead of all other methods in most tasks (23/30). Notably, our model always achieves the best performance in Office-Home dataset. Regarding all three datasets, the overall average performance is significantly improved over the best state-of-the-art baseline methods. The results of using SURF feature are too low to compare with DeCAF and IR features and are omitted.
To illustrate the effectiveness of our model, we consider the case in which all models use our IR features, and view the prediction results using t-SNE. Focusing on the AD task in which the accuracy of our MDAIR is 100% (and thus identical to ground truth), Fig. 9 shows that all other conventional methods contained mixed colors in the t-SNE view. These results indicate that our modified distribution alignment is better than several baseline methods even using the same features. In addition, we test our IR features using the original MEDA method (MEDA-IR in Tables III, IV and V); results still turn out that our modified distribution alignment is better than the previous MEDA model.
In our experiments, the optimal parameters for different tasks might be different. To more easily reproduce our results, we use consistent parameters: , , , and .
Task | Best baseline | MDAIR | Improvement |
---|---|---|---|
Office+Caltech-10 | 91.9 | 96.7 | 4.8% |
Office-31 | 84.3 | 89.8 | 5.5% |
Office-Home | 62.8 | 72.8 | 10% |
We list the improvement of our model based on the best state-of-the-art methods in Table VI. For three datasets (Office+Caltech-10, Office-31, and Office-Home), our method improves the absolute accuracy by 4.8%, 5.5%, and 10% respectively. Therefore, the quality of our model exceeds that of all the state-of-the-art methods.
There are two prominent reasons for the success of our model. First of all, our model takes advantage of deep features from the Inception-ResNet-v2 model, which produces better features than SURF and DeCAF features. And better features reduce the difference between the source and target domains. Secondly, the modified distribution alignment facilitate the alignment of the distribution of features which leads to higher accuracy.
In addition, our experiments imply that the last fully connected layer is the best layer for feature extraction. A likely reason is that the layer collects all features from the previous layer; hence it will form better features than previous layer. Although the last classification layer can be used for feature extraction from the IR model, the performance is worse than the last fully connected layer since features from classification layer will be affected by original trained classes. We observe that our model is compromised in some tasks (A W in Office+Caltech-10 and D A in Office-31 dataset). This caused by the intrinsic differences of various datasets, and so we cannot guarantee that our model always beats all other methods.
However, one shallow weakness of our model is that feature extraction affects the results significantly. We suggest that extracting feature from higher top-1 accuracy deep neural networks will further improve the accuracy.
In this paper, we are the first to extract features from a pre-trained Inception-ResNet-v2 model for the domain adaption problem. The experiment shows that the last fully connected layer is the best layer to extract features and the extracted features are better than DeCAF and Resnet-50 features. The modified distribution alignment model has a better performance than other models. We also test our model using three benchmark datasets. Extensive experiments demonstrate significant improvements in classification accuracy over the state-of-the-art.
There are some obvious areas for follow-up work. Extracting features from another well-trained deep neural network might generate a better input for the modified distribution alignment than the IR model. Testing on a broader set of unsupervised learning tasks will improve the applicability of our model. Also, a new distribution method will be beneficial for increasing the predictive accuracy.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, volume 2, 2018.International Conference on Artificial Intelligence and Statistics
, pages 129–136, 2010.Deep transfer learning with joint adaptation networks.
In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2208–2217. JMLR. org, 2017.Inception-v4, inception-resnet and the impact of residual connections on learning.
In Thirty-First AAAI Conference on Artificial Intelligence, 2017.