Cross-Domain Latent Modulation for Variational Transfer Learning

12/21/2020 ∙ by Jinyong Hou, et al. ∙ University of Otago 0

We propose a cross-domain latent modulation mechanism within a variational autoencoders (VAE) framework to enable improved transfer learning. Our key idea is to procure deep representations from one data domain and use it as perturbation to the reparameterization of the latent variable in another domain. Specifically, deep representations of the source and target domains are first extracted by a unified inference model and aligned by employing gradient reversal. Second, the learned deep representations are cross-modulated to the latent encoding of the alternate domain. The consistency between the reconstruction from the modulated latent encoding and the generation using deep representation samples is then enforced in order to produce inter-class alignment in the latent space. We apply the proposed model to a number of transfer learning tasks including unsupervised domain adaptation and image-toimage translation. Experimental results show that our model gives competitive performance.



There are no comments yet.


page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In machine learning, one can rarely directly apply a pre-trained model to a new dataset or a new task, as the performance of a learned model often plunges significantly for the new data which may have significant sampling bias or even belong to different distributions. Transfer learning can help us utilize the learned knowledge from a previous domain (the ‘source’) to improve performance on a related domain or task (the ‘target’) 

[31, 35, 41].

From the perspective of probabilistic modeling  [23, 39, 34]

, the key challenge in achieving cross-domain transfer is to learn a joint distribution of data from different domains. Once the joint distribution is learned, it can be used to generate the marginal distribution of the individual domains 

[16, 23]. Under the variational inference scenario, an inferred joint distribution is often applied to the latent space. Due to the coupling theory, inferring the joint distribution from the marginal distributions of different domains is a highly ill-posed problem [21]. To address this problem, UNIT [23] makes an assumption that there is a shared latent space for the two domains. Usually, this can be achieved by applying the adversarial strategy to the domains’ latent spaces. Another line of research focuses on the use of a complex prior to improve the representation performance for the input data [28, 36, 12]. However, the previous works neglect the role of the generation process for the latent space which could be helpful for cross-domain transfer scenarios.

In this paper, we propose a novel latent space reparameterization method, and employ a generative process to cater for the cross-domain transferability. Specifically, we incorporate a cross-domain component into the reparameterization transformation, which builds the connection between the variational representations and domain features in a cross-domain manner. The generated transfer latent space is further tuned by domain-level adversarial alignment and domain consistency between images obtained through reconstruction and generations. We apply our model to the homogeneous transfer scenarios, such as unsupervised domain adaptation and image-to-image translation. The experimental results show the efficiency of our model.

The rest of the paper is organized as follows. In Section 2, some related work is briefly reviewed. In Section 3, we outline the overall structure of our proposed model and develop the learning metrics with defined losses. The experiments are presented and discussed in Section 4. We conclude our work in Section 5, indicating our plan of future work.

2 Related Work

Latent space manipulation: As discussed above, for a joint distribution, manipulation of the latent space is common [17, 23, 22] for the cross-domain adaptation situations. One approach focuses on a shared latent space, where the latent encodings are regarded as common representations for inputs across domains. Some adversarial strategy is usually used to pool them together so that the representations are less domain-dependent. For the variational approach, works in [28, 48, 14] adopt complex priors for multi-modal latent representations, while other works [23, 34, 22] still assume a standard Gaussian prior. Another aproach is to use disentangled latent representations where the latent encoding is divided into some defined parts (e.g. style and content parts), then the model learns separated representations and swaps them for the transfer [9, 46, 6, 20]

. Our method is different from these approaches. In our model, learned auxiliary deep representation is used to generate perturbations to the latent space through a modified reparameterization using variational information from the counterpart domain. It helps generate cross-domain image translation. The transfer is carried out by a reparameterization transformation, using statistical moments retaining specific information for one domain, and deep representation providing information from another domain.

Varied Homogeneous transfer tasks: The manipulation on the latent space is often interwoven with the homogeneous image transfer together, such as unsupervised domain adaptation and image translation [29, 30, 5]. In the domain separation networks [4], separate encoding modules are employed to extract the invariant representation and domain-specific representations from the domains respectively, with the domain-invariant representations being used for the domain adaptation. References [3, 33, 11] transfer the target images into source-like images for domain adaptation. References [23, 47, 13] map the inputs of different domains to a single shared latent space, but cycle consistency is required for the completeness of the latent space. The UFDN [22] utilizes a unified encoder to extract the multi-domain images to a shared latent space, and the latent domain-invariant coding is manipulated for image translation between different domains.

In contrast, we adopt the pixel-level adaptation between domains from the cross-domain generation, but the proposed model can also be used at the feature-level due to the latent space alignment. Our model also has a unified inference model, but the consistency is imposed in a straightforward way, with reduced computational complexity.

3 Proposed Model

3.1 Problem setting

Let be a -dimensional data space, and the sample set with marginal distribution . The source domain is denoted by a tuple , and the target domain by . In our paper, we consider the homogeneous transfer with domain shift, i.e. , but . For the unsupervised pixel-level domain adaptation scenario, the label set is ( is the label space), and a task is considered too. However, only the source domain’s label set is available during transfer learning.

3.2 Transfer Latent Space

As any given marginal distribution can be yielded by an infinite number of joint distributions, we need to build an inference framework with some constraints. Under the variational autoencoder (VAE) framework, the latent space is one of the manipulation targets. We propose the transfer latent space as follows.

Transfer Latent Space . Let , be the domain samples. Let us have a map that extracts domain information and a feature representation given an input :

Suppose we construct a transfer map that generates a latent variable from and with domain crossovers:

The joint space formed by and samples is defined as a transfer latent space, denoted by .

The transfer latent space is intended to become a “mixer” for the two domains, as the resulted latent variables are under cross-domain influences. Hence the transfer latent space can be regarded as a generalization of the latent space.

3.3 Framework

Our framework is shown in Fig. 1. In our framework, we build the cross-domain generation by a unified inference model (as an implementation of the map ) and a generative model for the desired domain , e.g., the source domain in our model. A discriminator is utilized for the adversarial training. We use the terms “inference model” and “encoder” for , and “generative model” and “decoder” for interchangeably.

Figure 1: Architectural view of the proposed model. It encourages an image from target domain (blue hexagon) to be transformed to a corresponding image in the source domain (black hexagon). The transfer latent distributions and are learned which are used to generate corresponding images by the desired decoder. The deep representations are integrated into the reparameterization transformation with standard Gaussian auxiliary noise. Blue lines are for the target domain and black ones are for the source domain.

As discussed in section 3.2, under the variational framework, the domain information (here we remove the domain subscript for simplicity) is usually the pair . Let be the flattened activations of the last convolution layer in . Then, following the treatment in [15], and can be obtained by and , where are the weights and biases for and .

From our observations, both shallow (e.g. PCA features) and deep representations can be used to obtain domain information , in our end-to-end model we use the latter. We choose the high-level activation of the last convolutional layer, i.e., as the deep representation [44], where are the weights and biases for the deep abstractions.

Having obtained the domain information and deep representation , a natural choice for the transfer map is through reparameterization. Here we propose a modified reparameterization trick to give the sampling from the transfer latent space as follows:




where () is the sample of the deep representation space (); and ( and

) are the mean and standard deviation of the approximate posterior for the source (target) domain;

are trade-off hyperparameter to balance the deep feature modulation and the standard Gaussian noise

; and

stands for the element-wise product of vectors. Therefore, the auxiliary noise in VAE resampling is now a weighted sum of a deep representation from the other domain and Gaussian noise, different from the standard VAE framework. Because the modified reparameterization allows a domain’s deep representation to get modulated into another domain’s latent representation, we call our model “Cross-Domain Latent Modulation”, or CDLM for short.

Now we have obtained the transfer encodings by a unified encoder. Following the probabilistic encoder analysis [15], a shared inference model confines the latent variables into the same latent space. But this cannot guarantee them to be aligned. To pull the domains close, an adversarial strategy [7, 37] should be used for the alignment. The gradient reversal layer [7] is used in our model, by which adversarial learning is introduced to learn transferable features that are robust to domain shift. The adversarial alignment between and is for domain-level.

Furthermore, for better interpretation of the modulated reparameterization, let . Then for , we have . For the -th element of the distribution moments are given as follows:


Therefore, the and are


Here, it is reasonable to assume and when the training is finished. With a practical setting of , and in effect , Eq. (5) can be further simplified to , and . Then we can see that can be regarded as a location shift of under the influence of , which helps reduce the domain gap; can be taken as a recoloring of under the influence from the target. The formulation of can be similarly interpreted. These modulated encodings are hence constructed in a cross-domain manner.

Next, we apply the consistency constraint to the transfer latent space with modulation for further inter-class alignment. It has been found that consistency constraints preserve class-discriminative information [34, 13, 8]. For our model, the consistency is applied to the reconstructions from modulated encodings and the corresponding generations from deep representations. Let be the generative model for domain image generation from the transfer latent space. The consistency requirements are


where is the reconstruction of the source (), is for the target (). can also function as a generative model, generating and for the source and target domain respectively. Also, the consistencies can guide the encoder to learn the representations from both domains.

Finally, a desired marginalized decoder, e.g. the source decoder, is trained to map the target images to be source-like. We render the target’s structured generation for the test mode. For this end, we do not need the source to be taken into account for the test. That means a test image from the target domain first passes through the inference model and obtains its deep feature . Then it is fed into the generation model to generate an image with source style but keep its own class. That is to make the marginal distribution , but keep its class .

3.4 Learning

Our goal is to update the variational parameters to learn a joint distribution and the generation parameters for the desired marginal distribution. Since the latent variables are generated with inputs from both domains, we have a modified formulation adapted from the plain VAE:



is the Kullback-Leibler divergence, and the transfer latent variable

can be either or . Minimizing is equivalent to maximizing the variational evidence lower bound (ELBO) :


where the first term corresponds to the reconstruction cost (

), and the second term is the K-L divergence between the learned latent probability and the prior (specified as

) (). Considering the reconstruction of , and the K-L divergence for both and , we have


To align the deep representations of the source and target domains, an adversarial strategy is employed to regularize the model. The loss function is given by


where is the discriminator to predict from which domain the deep representation feature is.

From the analysis in Section 3.2, we can introduce a pairwise consistency between the reconstruction and the generation for the source and the target in an unsupervised manner respectively. The consistencies regularization improve the inter-class alignment. For the consistency loss , both the and -norm penalty can be used to regularize the decoder. Here we simply use MSE. Let and be the consistency for the domains respectively. is given as a combination of these two components, weighted by two coefficients and , respectively:


Then, the variational parameters and generation parameters are updated by the following rules:


where are the learning rates. Note, that only data from the desired domain (the source) are used to train the reconstruction loss. The items approximate the transfer latent space to their prior. Hyperparameters , are used to balance the discriminator loss and reconstruction loss.

4 Experiments

Source MNIST USPS MNIST MNISTM Fashion Fashion-M Linemod 3D
Target USPS MNIST MNISTM MNIST Fashion-M Fashion Linemod Real
Source Only 0.634 0.625 0.561 0.633 0.527 0.612 0.632
DANN [7] 0.774 0.833 0.766 0.851 0.765 0.822 0.832
CyCADA [11] 0.956 0.965 0.921 0.943 0.874 0.915 0.960
GtA [33] 0.953 0.908 0.917 0.932 0.855 0.893 0.930
CDAN [27] 0.956 0.980 0.862 0.902 0.875 0.891 0.936
PixelDA [3] 0.959 0.942 0.982 0.922 0.805 0.762 0.998
UNIT [23] 0.960 0.951 0.920 0.932 0.796 0.805 0.964
CDLM () 0.961 0.983 0.987 0.962 0.913 0.922 0.984
Target Only 0.980 0.985 0.983 0.985 0.920 0.942 0.998
Table 1: Mean classification accuracy comparison. The “source only" row is the accuracy for target without domain adaptation training only on the source. The “target only" is the accuracy of the full adaptation training on the target. For each source-target task the best performance is in bold

We conducted extensive evaluations of CDLM in two homogeneous transfer scenarios including unsupervised domain adaptation and image-to-image translation. During the experiments, our model was implemented using TensorFlow 

[1]. The structures of the encoder and the decoder adopt those of UNIT [23] which perform well for image translation tasks. A two-layer fully connected MLP was used for the discriminator. SGD with momentum was used for updating the variational parameters, and Adam for updating generation parameters. The batch size was set to 64. During the experiments, we set , , and . For the datasets, we considered a few popular benchmarks, including MNIST [19], MNSITM [7], USPS [18], Fashion-MNIST [43], Linemod [10, 42], Zap50K-shoes [45] and CelebA [25, 24].

4.1 Datasets

We have evaluated our model on a variety of benchmark datasets. They are described as follows.

MNIST: MNIST handwritten dataset [19] is a very popular machine learning dataset. It has a training set of 60,000 binary images, and a test set of 10,000. There are 10 classes in the dataset. In our experiments, we use the standard split of the dataset. MNISTM [7] is a modified version for the MNIST, with random RGB background cropped from the Berkeley Segmentation Dataset111URL

USPS: USPS is a handwritten zip digits datasets [18]. It contains 9298 binary images (), 7291 of which are used as the training set, while the remaining 2007 are used as the test set. The USPS samples are resized to , the same as MNIST.

Fashion: Fashion [43] contains 60,000 images for training, and 10,000 for testing. All the images are grayscale, in size space. In addition, following the protocol in [7], we add random noise to the Fashion images to generate the FashionM dataset, with random RGB background cropped from the Berkeley Segmentation Dataset.

Linemod 3D images Following the protocol of [3], we render the LineMod [10, 42] for the adaptation between synthetic 3D images (source) and real images (target). The objects with different poses are located at the center of the images. The synthetic 3D images render a black background and a variety of complex indoor environments for real images. We use the RGB images only, not the depth images.

CelebA: CelebA [25] is a large celebrities face image dataset. It contains more than 200k images annotated with 40 facial attributes. We select 50K images randomly, then transform them to sketch images followed the protocol of [24]. The original and sketched images are used for translation.

UT-Zap50K-shoes: This dataset [45] contains 50K shoes images with 4 different classes. During the translation, we get the edges produced by canny detector.

4.2 Unsupervised Domain Adaptation

We applied our model to unsupervised domain adaptation, adapting a classifier trained using labelled samples in the source domain to classify samples in the target domain. For this scenario, only the labels of the source images were available during training. We chose DANN 

[7] as the baseline, but also compared our model with the state-of-the-art domain adaptation methods: Conditional Domain Adaptation Network (CDAN) [27], Pixel-level Domain Adaptation (PixelDA) [3], Unsupervised Image-to-Image translation (UNIT) [23], Cycle-Consistent Adversarial Domain Adaption (CyCADA) [11], and Generate to Adapt (GtA) [33]. We also used source- and target-only training as the lower and upper bound respectively, following the practice in [3, 7].

4.2.1 Quantitative Results

The performance of domain adaptation for the different tasks is shown in Table 1

. There are 4 scenarios and 7 tasks. Each scenario has bidirectional tasks for adaptation except LineMod. For LineMod, it is adapted from synthetic 3D image to real objects. For the same adaptation task, we cite the accuracy from the corresponding references, otherwise the accuracies for some tasks are obtained by training the open-source code provided by authors with suggested optimal parameters, for fair comparison.

From Table 1 we can see that the our method has a higher advantage compared with the baseline and the source-only accuracy, a little lower than the target-only accuracy from both adaptation directions. In comparison with other models, our model has a better performance for most of tasks. The CDLM has a higher adaptation accuracy for the scenarios with seemingly larger domain gap, such as MNISTMNISTM and FashionFashionM. For the 3D scenario, the performance of our model is a little lower than PixelDA [3], but outperforms all the other compared methods. In PixelDA, the input is not only source image but also depth image pairs. It might be helpful for the generation. Besides, we visualize the t-SNE [38] for latent encodings () w.r.t the source and the target, respectively. Fig. 4 is the visualization for task MNISTMMNIST and MNISTUSPS, and it shows that both are aligned well.

4.2.2 Qualitative Results

Our model can give the visualization of the adaptation. Fig. 2 is the visualization for the digits and Fashion adaptation respectively. For the scenario of MNIST and USPS, the generation for the task USPSMNIST is shown in Fig. 1(a). The target MNIST is transferred to the source USPS style well, meanwhile it keeps the correspondent content (label). For example, the digit ‘1’ in MNIST become more leaned and ‘9’ more flatten. Also in Fig. 1(b), the target USPS becomes the MNIST style. For the scenario of MNISTMNISTM, our proposed model can remove and add the noise background well for adaptation.

(e) FashionFashionm
(f) FashionmFashion
Figure 2: Visualization for the adaptations. 6 different tasks are illustrated. For each task, the first row shows target images and the second row shows the adapted images with source-like style.
Figure 3: Linemod 3D SyntheticReal. For a query image (on the left), different adaptation images (to the right) with various poses can be generated.
Figure 4: t-SNE visualization of cross-domain latent encodings . The are in blue, and the in red.

For the scenario of Fashion, the fashion items have more complicated texture and content. In addition, the noisy backgrounds pollute the items randomly, for example, different parts of a cloth are filled with various colors. For visualizations, specifically, Fig. 1(e) is for the task Fashion FashionM. The proposed model can remove the noisy background and maintain the content. On the other hand, Fig. 1(f) shows that the original Fashion images are added with similar noisy background as the source. This is promising for a better adaptation performance.

For Linemod3D, the real objects images with different backgrounds are transferred to the synthetic 3D images with black background. Due to the 3D style, the generation of the target gives different poses. For example in Fig.3, different poses of the iron object are obtained for different trials.

(a) “edge” to “shoes”
(b) “sketch” to “face”
(c) “shoes” to “edge”
(d) “face” to “sketch”
Figure 5: Visualization of cross-domain image mapping.

4.3 Cross-Domain Image Mapping

The proposed model also can be used for the cross-domain image mapping.

Fig. 5 gives a demonstration of the image style translation. Specifically, for “shoes” and “edges” in Fig. 4(a) and 4(c), we can see that the proposed model can translate “edges” to its counterpart quite well. The translation is stochastic – an “edge” pattern can be used to generate “shoes” in different colors with different trials. For the more challenging “face” and “sketch” translations, the proposed model also performs well. The generations have some variations compared with the original images. In general, our method can generate realistic translated images. However, we find that compared with the translation from sketches to real images, the reverse task seems harder. For example, when a face image is given, the generated sketch loses some details. The reason may be the low-level feature is neglected when the deep feature acts as the condition.

For further evaluation, quantitative performance is evaluated for image mapping. SSIM [40], MSE, and PSNR are used for the evaluation. The results are shown in Table 2. We can see that our model outperforms E-CDRD [24], which learns a disentangled latent encoding for the source and the target for domain adaptation. Meanwhile, it matches the performance of StarGAN [5], which is designed for multi-domain image translation. The result shows that our model can map cross-domain images well compared to these prior works.

Models “Sketch” to “Face”
E-CDRD [24] 0.6229 0.0207 16.86
StarGAN [5] 0.8026 0.0142 19.04
CDLM 0.7961 0.0140 19.89
Table 2: Performance for image mapping.

In addition, we also conduct the classification to evaluate the translation performance. We take shoes as an example which are labeled to 4 different classes. The recognition accuracy of our proposed model for task shoesedge is 0.953, which is higher than the results of PixelDA (0.921) and UNIT (0.916) respectively.

4.4 Model Analysis

The effect of encoder settings – depth and different : In our model, the deep features are utilized to cross-modulate the transfer latent encoding. Therefore the deep feature is an important factor in our framework and is influenced by the depth of the encoder. During the experiments, we use MNIST USPS and Fashion FashionM as the evaluation tasks. For the first one, they have different content, but with the same background. The second task is a totally different scenario, the images have the same content but different background. The outputs of different encoder layers () are used for the experiments.

Tasks/Layers Conv4 Conv5 Convlast
MNIST USPS 0.954 0.956 0.961
Fashion FashionM 0.890 0.905 0.913
Table 3: Adaptation accuracy with different layer depth for Tasks MNIST USPS and Fashion FashionM.
Tasks / () (0.1,1.0) (0.5,0.5) (0.9,0.1) (1.0,0.1) (1.0, 0)
MNIST USPS 0.320 0.723 0.961 0.961 0.961
Fashion FashionM 0.226 0.513 0.912 0.913 0.913
Table 4: Adaptation accuracy with different () for Tasks MNIST USPS and Fashion FashionM.
Model/Tasks MNISTUSPS USPSMNIST FashionFashionM FashionMFashion
CDLM w/o 0.635 0.683 0.646 0.672
CDLM+ 0.689 0.695 0.682 0.691
CDLM+ 0.951 0.980 0.912 0.915
CDLM++ 0.961 0.983 0.913 0.922
Table 5: Evaluation on the effect of unsupervised consistency metrics. The recognition accuracy is shown for four tasks in the unsupervised domain adaptation scenario. Our model is on the last row with both the and , which achieves the best performance.

As the result (Table 3) shows, a higher accuracy is achieved when more layers are used to extract the deep representations. The accuracy gain of the task MNIST USPS is lower than that of Fashion

FashionM. This is expected as features extracted by higher layers would normally eliminate lower-level variations between domains, such as change of background and illumination in the images.

For , we fixed the last convolutional layer for the deep representations and evaluate different values. From Table 4, we can see that the performance drops down significantly with a smaller compared with , and increased with a larger . The performance seems to be stabilized when is greater than 0.9 while remains 0.1. Following the standard VAE, we keep the noise () in the evaluations. Meanwhile, our model works well even when . These results suggest the deep representation plays a crucial role in the cross-domain modulation.

-Distance: In a theoretical analysis of the domain discrepancy [2], Ben-David et al. suggests that -distance can be used as a measure for the domain discrepancy. As the exact -distance is intractable, a proxy is defined as , where is the generalization error of a binary classifier (e.g. kernel SVM) trained to distinguish the input’s domain (source or target). Following the protocol of [26, 32], we calculate the -distance on four adaptation tasks under the scenarios of Raw features, DANN features, and CDLM features respectively. The results are show in Fig. 6. We observe that both the DANN and CDLM reduce the domain discrepancy compared with the Raw images scenario, and the -distance of CDLM is smaller than the DANN’s. This demonstrates that it is harder to distinguish the source and the target by the CDLM generations.

Figure 6: -distances comparison for four tasks.

Convergence: We also conduct the convergence experiment with training error on task MNSIT-USPS to evaluate our model. As shown in the Fig. 7, our model has a better convergence than DANN, thought there are some oscillations at the beginning of the training. In addition, the error of CDLM is lower that the DANN, which demonstrate that CDLM has a better adaptation performance. This is consistent with the adaptation performance in Table 1.

Figure 7: Convergence of CDLM compared with DANN.

The effect of unsupervised consistency metrics: In our model, two unsupervised consistency metrics are added for generation in good effects. The adaptation accuracy is used for evaluation. Table 5 is the results for the four different tasks. The performance w/o is dropped down because the decoder cannot generate realistic cross-domain images. connects outputs generated from the and only for the target, which improves the performance slightly. Meanwhile, we can see that the loss boosts the accuracy for adaptation significantly, which connects the two domains with the generations by the . Finally, the scenario with both and gives the best performance in all four tasks. It bridges both the and between the two domains.

5 Conclusion

In this paper, we have presented a novel variational cross-domain transfer learning model with cross modulation of deep representations from different domains. A shared transfer latent space is introduced, and the reparameterization transformation is modified to enforce the connection between domains. Evaluations carried out in unsupervised domain adaptation and image translation tasks demonstrate our model’s competitive performance. Its effectiveness is also clearly shown in visual assessment of the adapted images, as well as in the alignment of the latent information as revealed by visualization using t-SNE. Overall, competitive performance has been achieved by our model despite its relative simplicity.

For future work, we intend to further improve our variational transfer learning framework and use it for heterogeneous, multi-domain transfer tasks.


  • [1] M. Abadi et al. (2016) TensorFlow: large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467 [cs.DC]. Note: 1603.04467 External Links: 1603.04467 Cited by: §4.
  • [2] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira (2007) Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems (NeurIPS), pp. 137–144. External Links: ISBN 9780262195683, ISSN 1049-5258 Cited by: §4.4.
  • [3] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan (2017)

    Unsupervised pixel-level domain adaptation with generative adversarial networks

    pp. 95–104. Note: DL-DA External Links: ISBN 9781538604571 Cited by: §2, §4.1, §4.2.1, §4.2, Table 1.
  • [4] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan (2016) Domain separation networks. pp. 343–351. Note: It used the encoder-decoder to extract the features, but used the loss functions to get the invariant and specific(difference) features. Then the invariant features are used for adaptation. External Links: ISSN 1049-5258 Cited by: §2.
  • [5] Y. Choi, M. Choi, M. Kim, J. W. Ha, S. Kim, and J. Choo (2018) StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. pp. 8789–8797. External Links: 1711.09020, ISBN 9781538664209, ISSN 1063-6919 Cited by: §2, §4.3, Table 2.
  • [6] Z. Feng, A. Zeng, X. Wang, D. Tao, C. Ke, and M. Song (2018) Dual swap disentangling. pp. 5894–5904. External Links: 1805.10583, ISSN 1049-5258 Cited by: §2.
  • [7] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. S. Lempitsky (2016)

    Domain-adversarial training of neural networks

    Journal of Machine Learning Research (JMLR) 17 (59), pp. 1–35. Cited by: §3.3, §4.1, §4.1, §4.2, Table 1, §4.
  • [8] M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li (2016) Deep reconstruction-classification networks for unsupervised domain adaptation. pp. 597–613. Cited by: §3.3.
  • [9] A. Gonzalez-garcia, J. van de Weijer, and Y. Bengio (2018) Image-to-image translation for cross-domain disentanglement. pp. 1294–1305. Cited by: §2.
  • [10] S. Hinterstoisser, C. Cagniart, S. Ilic, P. Sturm, N. Navab, P. Fua, and V. Lepetit (2012) Gradient response maps for real-time detection of texture-less objects. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 34 (5), pp. 876–888. Note: Dataset-LineMOD-Original External Links: ISSN 0162-8828 Cited by: §4.1, §4.
  • [11] J. Hoffman, E. Tzeng, T. Park, J. Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell (2018) CyCADA: cycle-consistent adversarial domain adaptation. In International Conference on Machine Learning (ICML), J. G. Dy and A. Krause (Eds.), PMLR, Vol. 80, pp. 3162–3174. Note: It comibed the cycle-gan into the model. It’a very complicated model, which there were lots of objective functions for the UDA. A possible pro is that it can be used for the segmentation besides the recognition. External Links: ISBN 9781510867963 Cited by: §2, §4.2, Table 1.
  • [12] M. D. Hoffman and M. J. Johnson (2016) ELBO surgery: yet another way to carve up the variational evidence lower bound. In Advances in Neural Information Processing Systems Workshop (NeurIPSW), Cited by: §1.
  • [13] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim (2017) Learning to discover cross-domain relations with generative adversarial networks. In International Conference on Machine Learning (ICML), Y. W. T. Doina Precup (Ed.), PMLR, Vol. 70, pp. 1857–1865. Note: This is DiscoGAN, that used two paralled GAN for the alignment between two domains. Cited by: §2, §3.3.
  • [14] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling (2016) Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems (NeurIPS), pp. 4743–4751. Note: VAE prior External Links: arXiv:1606.04934v2, ISSN 1049-5258 Cited by: §2.
  • [15] D. P. Kingma and M. Welling (2014) Auto-encoding variational Bayes. Note: Cited by: §3.3, §3.3.
  • [16] D. P. Kingma (2013) Fast gradient-based inference with continuous latent variable models in auxiliary form. arXiv:1306.0733 [cs.LG]. External Links: 1306.0733 Cited by: §1.
  • [17] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther (2016) Autoencoding beyond pixels using a learned similarity metric. In International Conference on Machine Learning (ICML), M. F. Balcan and K. Q. Weinberger (Eds.), PMLR, Vol. 48, pp. 1558–1566. Note: It used the VAE and 3 adversarial loss functions for the generation. External Links: 1512.09300, ISBN 9781510829008 Cited by: §2.
  • [18] Y. Le Cun, L. D. Jackel, B. Boser, J. S. Denker, H. P. Graf, I. Guyon, D. Henderson, R. E. Howard, and W. Hubbard (1989) Handwritten digit recognition: applications of neural network chips and automatic learning. IEEE Communications Magazine 27 (11), pp. 41–46. Note: Dataset-USPS Cited by: §4.1, §4.
  • [19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Note: Dataset-MNIST CNN Cited by: §4.1, §4.
  • [20] H. Y. Lee, H. Y. Tseng, J. B. Huang, M. Singh, and M. H. Yang (2018) Diverse image-to-image translation via disentangled representations. In

    Europen Conference on Computer Vision (ECCV)

    pp. 36–52. External Links: 1905.01270, ISBN 9783030012458, ISSN 1611-3349 Cited by: §2.
  • [21] T. Lindvall (2002) Lectures on the coupling method. Dover. External Links: ISBN 978-0486421452 Cited by: §1.
  • [22] A. H. Liu, Y. C. Liu, Y. Y. Yeh, and Y. C. F. Wang (2018) A unified feature disentangler for multi-domain image translation and manipulation. pp. 2590–2599. Note: VAE attributes adversarial disentanglement image translation pixel unsupervised domain adaptation manipulate the latent representation to generate External Links: 1809.01361, ISSN 1049-5258 Cited by: §2, §2.
  • [23] M. Y. Liu, T. Breuel, and J. Kautz (2017) Unsupervised image-to-image translation networks. pp. 700–708. External Links: ISSN 1049-5258 Cited by: §1, §2, §2, §4.2, Table 1, §4.
  • [24] Y. C. Liu, Y. Y. Yeh, T. C. Fu, S. D. Wang, W. C. Chiu, and Y. C. F. Wang (2018) Detach and adapt: learning cross-domain disentangled deep representation. pp. 8867–8876. Note: It added the attributes as condition for the generation. It used the face and its sketched images for the evaluation. External Links: 1705.01314, ISBN 9781538664209, ISSN 1063-6919 Cited by: §4.1, §4.3, Table 2, §4.
  • [25] Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In IEEE International Conference on Computer Vision (ICCV), pp. 3730–3738. Note: Dataset-celebface External Links: 1411.7766, ISBN 9781467383912, ISSN 1550-5499 Cited by: §4.1, §4.
  • [26] M. Long, Y. Cao, J. Wang, and M. I. Jordan (2015) Learning transferable features with deep adaptation networks. pp. 97–105. External Links: ISBN 9781510810587 Cited by: §4.4.
  • [27] M. Long, Z. Cao, J. Wang, and M. I. Jordan (2018) Conditional adversarial domain adaptation. pp. 1640–1650. Note: In this paper, the labels’ predictions were the condition cominbed the features extracted by the encoder. Also the entropy can be used for the condition. The condition was confied in one domain not a cross domain. Cited by: §4.2, Table 1.
  • [28] S. Mahajan, I. Gurevych, and S. Roth (2020) Latent normalizing flows for many-to-many cross-domain mappings. Note: External Links: 2020.06661 Cited by: §1, §2.
  • [29] M. Naseer, S. H. Khan, H. Khan, F. S. Khan, and F. Porikli (2019) Cross-domain transferability of adversarial perturbations. External Links: 1905.11736 Cited by: §2.
  • [30] A. Noguchi and T. Harada (2019) Image generation from small datasets via batch statistics adaptation. pp. 2750–2758. Note: GAN, Domain Shift, Generation External Links: 1904.01774 Cited by: §2.
  • [31] S. J. Pan and Q. Yang (2010) A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering (TKDE) 22 (10), pp. 1345–1359. Note: From Duplicate 2 (A survey on transfer learning - Pan, Sinno Jialin; Yang, Qiang) Transfer Learning-A survey Cited by: §1.
  • [32] X. Peng, Z. Huang, Y. Zhu, and K. Saenko (2020) Federated adversarial domain adaptation. In International Conference on Learning Representations (ICLR), Cited by: §4.4.
  • [33] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa (2018) Generate to adapt: aligning domains using generative adversarial networks. pp. 8503–8512. Note: It utilized the GAN (adversarial) for the source and target, but the features were used for the UDA not the images. The adversarial loss was used to align the features. AC-GAN aligns Features UDA Review: AC-GAN aligns features UDA Cited by: §2, §4.2, Table 1.
  • [34] E. Schonfeld, S. Ebrahimi, S. Sinha, T. Darrell, and Z. Akata (2019) Generalized zero-and few-shot learning via aligned variational autoencoders. pp. 8247–8255. Note: It proposed a new loss function based on the distribution alignment of wasserstein distance. zero- and few-short learnng based on the VAE. ∥μ_s - μ_t∥_2^2 + ∥σ_s - σ_t∥_2^2 External Links: 1812.01784, ISBN 9781728132938, ISSN 1063-6919 Cited by: §1, §2, §3.3.
  • [35] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu (2018) A survey on deep transfer learning. pp. 270–279. External Links: Document Cited by: §1.
  • [36] J. M. Tomczak and M. Welling (2018) VAE with a vampprior. In

    International Conference on Artificial Intelligence and Statistics (AISTATS)

    , F. P. Amos Storkey (Ed.),
    PMLR, Vol. 84, pp. 1214–1223. Note: GMM as prior External Links: 1705.07120 Cited by: §1.
  • [37] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 7167–7176. Note: ADDA. It utilized the adversarial strategy for UDA. It’s a popular reference for this kind UDA. A classifier was trained for the source first. Then a feature extractor for target was trained in an adversarial way. So the classifier for source can be used for target’s recognition. Cited by: §3.3.
  • [38] L. van der Maaten and G. Hinton. (2008) Visualizing data using t-SNE. Journal of machine learning research (JMLR) 9, pp. 2579–2605. Cited by: §4.2.1.
  • [39] L. Wang, A. G. Schwing, and S. Lazebnik (2017) Diverse and accurate image description using a variational auto-encoder with an additive gaussian encoding space. pp. 5757–5767. Note: It used the mixtures of Gaussian for a predefined clustered prior. GMM as prior, It gives a reason why the GMM parameters are fixed. External Links: 1711.07068, ISSN 1049-5258 Cited by: §1.
  • [40] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. Cited by: §4.3.
  • [41] K. R. Weiss, T. M. Khoshgoftaar, and D. Wang (2016) A survey of transfer learning. Journal of Big Data 3 (9), pp. 1–40. Cited by: §1.
  • [42] P. Wohlhart and V. Lepetit (2015)

    Learning descriptors for object recognition and 3d pose estimation

    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3109–3118. Note: Dataset-lineMOD-cropped External Links: 1502.05908, ISBN 9781467369640, ISSN 1063-6919 Cited by: §4.1, §4.
  • [43] H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747 [cs.LG]. Note: Dataset-Fashion External Links: 1708.07747 Cited by: §4.1, §4.
  • [44] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson (2014) How transferable are features in deep neural networks?. pp. 3320–3328. Note: It discussed the weights transferring. External Links: ISSN 1049-5258 Cited by: §3.3.
  • [45] A. Yu and K. Grauman (2017) Semantic jitter: dense supervision for visual comparisons via synthetic images. pp. 5570–5579. Note: Dataset-50K Shoes External Links: 1612.06341, ISBN 9781538610329, ISSN 1550-5499 Cited by: §4.1, §4.
  • [46] J. Zhang, Y. Huang, Y. Li, W. Zhao, and L. Zhang (2019) Multi-attribute transfer via disentangled representation. pp. 9195–9202. External Links: ISSN 2159-5399 Cited by: §2.
  • [47] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. pp. 2242–2251. Note: The CycleGAN and the DiscoGAN is very similar in my opion. Both was for two images translation (alignment) by two GANs and then back to its own domain alongside with the adversarial generation. Also, the cycle-consistency (MSE) for the individual domain was utilized. CycleGAN Cited by: §2.
  • [48] Z. M. Ziegler and A. M. Rush (2019) Latent normalizing flows for discrete sequences. pp. 7673–7682. External Links: 1901.10548, ISBN 9781510886988 Cited by: §2.