LRS-DAG: Low Resource Supervised Domain Adaptation with Generalization Across Domains

09/15/2019 ∙ by Rheeya Uppaal, et al. ∙ University of Massachusetts Amherst 0

Current state of the art methods in Domain Adaptation follow adversarial approaches, making training a challenge. Other non-adversarial methods learn mappings between source and target domains, to achieve reasonable performance. However, even these methods do not focus a key aspect of maintaining performance on the source domain, even after optimizing over the target domain. Additionally, there exist very few methods in low resource supervised domain adaptation. This work proposes a method, LRS-DAG, that aims to solve these current issues in the field. By adding a set of "encoder layers" which map the target domain to the source, and can be removed when dealing directly with the source data, the model learns to perform optimally on both domains. LRS-DAG is unique in the sense that a new algorithm for low resource domain adaptation, which maintains performance over the source, with a new metric for learning mappings has been introduced.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Domain adaptation (Huang et al. (2007), Ben-David et al. (2010)) aims to generalize a model from a source domain, with vast amounts of data, to a target domain. Data in the target domain is almost always a large pool of unlabelled or partially labelled data. Domain Adaptation is typically achieved by learning a mapping between the domains.

A popular way of learning these mappings is using Generative Adversarial Networks (Goodfellow et al. (2014)), using the cycle consistency constraint from the CycleGAN (Zhu et al. (2017)). This has shown promising results, (Hoffman et al. (2017), Liu et al. (2017)); however, adversarial models are known to be notoriously hard to jointly train. (Arjovsky and Bottou (2017))

There has been a series of non-adversarial approaches to learning domain mappings. (Hoshen and Wolf (2018), Long et al. (2015), Sun et al. (2016), Sun and Saenko (2016), Haeusser et al. (2017)). However, all the aforementioned methods focus on the problem of large amounts of unlabelled data in the target domain. There exist many problems where collecting data at a large scale is hard. (Motiian et al. (2017a), Patel et al. (2015)) There is limited work in this domain, (Motiian et al. (2017a), Motiian et al. (2017b), Hosseini-Asl et al. (2018)), however, the typical approach is to use low capacity models to learn from this low resource data.

Additionally, there is almost no focus on maintaining performance on the source domain, while augmenting the target domain performance. This may be crucial in tasks where a unified model on both domains must be used, and thus a paradigm similar to multi-task learning would be required. (Jiang (2008)) For example, in the task of stellar classification, teaching the model to detect rare Supernovae should not deteriorate performance on detecting regular stars.

The method proposed in this work aims to address all of these problems: (1) Identifying method for Supervised Domain Adaptation with limited labelled data, and (2) Creating a model that maintains performance on the Source domain even after training on the Target domain. In addition, the method also trains in a non-adversarial manner, which is an added advantage. The proposed method divides the network into two sets of layers, and a set of ’encoder’ layers are inserted between the two sets of layers of the original network. The ’Encoder layers’ are trained to map the target distribution to the source (rather than mapping both into a domain invariant space, as with other methods), without changing the weights in the original network. Thus, simply removing the encoder layers assures the original optimal performance of the model on the source domain. The encoder layers are trained by minimizing a measure of distance between the two distributions: essentially, the Kullback–Leibler divergence, and second order statistics have been considered as objective functions. The proposed method has been implemented on two sets of Source-Target datasets, and two different neural network architectures. While the results are comparable to fine-tuning, the method maintains generalization across the domains, and shows promising results for future work.

The main contributions and unique aspects of this work are: (1) Proposing and testing a set of new metrics for minimizing feature covariances across domains (2) Proposing a new method in the supervised low-resource domain adaptation setting, which in being non-adversarial is significantly easier to train (3) Proposing a model which can generalize performance across the source and target domains. It must also be noted that the proposed method can be made to handle the standard case of domain adaptation of high resource unlabelled data, with minor tweaks. The implementation of the unsupervised variant is part of future scope.

2 Related Work

Domain Adaptation primarily focuses on reducing a domain shift, in three major ways. The first approach applies a form of regularization to better fit the model to the target domain. (Aytar and Zisserman (2011), Bergamo and Torresani (2010), Becker et al. (2013))

The second is to transform both domains into a domain invariant space, and make further inference for the specific task, based on the features in this space. A popular approach for this is to use Generative Adversarial Networks (Goodfellow et al. (2014)), using the cycle consistency constraint from the CycleGAN (Zhu et al. (2017)). This puts the constraint on a particular example, that is converted from the source to target and back to the source, such that the same example is obtained. (Hoffman et al. (2017), Liu et al. (2017)) Manders et al. (2018)

align predicted class probabilities across domains to achieve state of the art results, in addition to being robust to overfitting. These class of methods consistently show state of the art results on standard benchmarks. However, all these methods train models adversarially with a minimax objective which makes reaching a optimum hard. (

Arjovsky and Bottou (2017)) In fact, recent work shows that the objective function of GANs have no optimum, and must be treated as equilibriation problems, showing that the use of traditional optimization algorithms on GANs is ’broken’. (Gemp and Mahadevan (2018), Mescheder et al. (2017) )

The third method is to find some form of a mapping from the source domain to the target domain. The proposed method roughly falls into this category, with the slight difference that a mapping from the target to the source domain is learnt. Sun and Saenko (2016) and Sun et al. (2017)

are closely related to the proposed method. They align the second order statistics of the source and target distributions with a non-linear transformation. The loss used is the CORAL loss, which is the Frobenious norm of the correlations of the source and target domains.



Unlike LRS-DAG, the method works on unsupervised domain adaptation. Additionally, the model does not learn useful information about the source while learning, and does not generalize across domains. It also uses a strong prior for both domains, by plugging in Alexnet as a stem for the network.

Haeusser et al. (2017) follows a very similar setting, with also using an unlabelled target domain. They learn statistically domain invariant embeddings, while minimizing the classification error on the labeled source domain. This models holds the same weaknesses as Deep CORAL.

3 Methodology

Figure 1: The architecture for the proposed methodology, with an arbitrary neural network trained on a classification task. The notation has been simplified such that S and T denote single datapoints from and .

3.1 Lrs-Dag

The LRS-DAG method works for an arbitrary model trained on the task of classification on the source data, , where and . The layers of the network are divided into two groups, and . The model is trained in a standard manner, with the objective being to minimize an arbitrary classification loss. At the end of training, learns a function , which maps to , . Similarly, learns the function , which maps to .

The key idea to generalizing across both domains is keep the mappings created by and unaltered, and instead leveraging them in their unaltered condition to optimize performance over the target data, , where and . Thus, the weights of and are kept frozen in the next phase of training over the target domain data.

A new set of layers, the ’Encoder layers’, represented by , are introduced between and , in this phase (as shown in Figure 1). With the target domain, gets as input , and must somehow map that to . For this, is trained to learn a function such that . This would allow the input to to be regardless of the current domain, and can function in the same manner. As visible, the objective function to be minimized for training would be some measure of the difference between and

. Six measures of the loss function have been proposed:

  • Method 1:

  • Method 2:

  • Method 3:

  • Method 4:

  • Method 5: := +

  • Method 6: :=

where can be defined as any classification loss on , which in this case has been defined as the cross entropy loss between and , is the CORAL loss from section 2 and , , and are the means and covariances of the source and target sets respectively. Methods 3 and 5 have both been included, since KL divergence is not symmetric. Method 6 has been included as a comparison method. Since Deep CORAL is implemented in a different architecture and data setting than LRS-DAG, only the CORAL loss can be used as a comparison metric.

During inference, depending on the domain the model is currently being applied to, the encoder layers could be included or ignored from the forward pass ((a) and (b) in Figure 1).

The proposed domain adaptation method is aimed to be model agnostic. Hence, it should perform for any arbitrary network, trained on the task of classification. For this reason, it has been tested with a basic fully connected network (for initial experimentation and proof of concept), and then with a standard CNN used for classification on the selected datasets.

3.2 Other Aspects of the Training Regime

A practical issue with the above loss metrics is that are different, and the metrics use a one-to-one correspondence from the target domain. Two solutions were considered. The first is to parameterize the entire source distribution by finding and . Then sample points from a multivariate Gaussian with

. However, since these are just estimates of the mean and covariance of the true source distribution, they may be biased. This might lead to

learning a spurious function.

The other option is to simply sample points from . However, an issue with this could be that a certain degree of information about the observed distribution would be lost. An extreme case of this would be where all the points are sampled from the tails of the observed souce distribution, thus mapping to a distribution different from .

Both methods have been tested in the experimental section, and results are presented in Section 6. For simplicity, the first method shall be referred to as ’indirect sampling’ and the second method shall be referred to as ’random sampling’.

4 Datasets

Mnist

The MNIST dataset contains 28x28 sized grayscale images of handwritten digits labelled from 0 to 9, and predefined training and testing splits of 60,000 and 10,000 examples apiece. The images were scaled to 32x32 and normalized. This has been used as the source domain.

Svhn

The Street View House Number dataset is a real-world image dataset obtained from Google Street View images. Like MNIST, it contains images of cropped digits between 0 and 9, but the images come from a significantly harder problem. The dataset consists of approximately 73,000 training images (out of which 10% has been retained for the limited labelled data scenario) and 26,000 test images. The images were converted to greyscale and normalized. MNIST-SVHN is a standard benchmark for domain adaptation tasks, which is why these datasets have been used for initial testing.

Synthetic-MNIST

To see how LRS-DAG performs with different levels of domain shift, this dataset was created by applying a series of transformations on MNIST. Random horizontal flips over samples from the data were applied, and images were shered. In addition to this, the brightness, contrast and saturation of images was randomly changed. Like with SVHN, only 10% of the labelled training data was used.

Validation sets were made from the training splits for these datasets, and rolled back into the training sets after performing a grid search over the hyperparameter space, and judging performance over the validation set.

Figure 2: Left: Samples from the MNIST, SVHN and Syn-MNIST datasets. Right: Distributions of classes over the test data, for all datasets.

5 Experiments

The goal of this series of experiments was to test the LRS-DAG method, with all its variants of loss functions. To show that the proposed training regime would work as an efficient form of Domain Adaptation, it has been tested over different models, and different sets of datasets. To test for the correctness of the hypothesis that the method is model agnostic, all experiments have been run for two networks:

  • Model 1: A fully connected network with 4 hidden layers. The output layer generates softmax predictions for all classes. , and consist of the bottom two, middle two and top two layers of the network. is ignored when in the source domain. The network has no non-linearities.

  • Model 2

    : A standard CNN used for learning from domains of similar complexity. The network consists of 5 convolutional layers and one fully connected layer, each followed by ReLU non-linearity. Softmax is applied over the output of the last layer to give a confidence score for every class. As with Model 1,

    , and consist of the bottom two, middle two and top two layers of the network. takes and returns values of the same shape, in both models.

The model is first trained on the source domain for 100 epochs. Following this, there are three main sets of experiments: One with

with Model 1, with Model 2, with Model 2. For each of these sets of experiments, all the loss functions (with the direct and indirect sampling methods described in Section 3.2) are tested. Additionally, they are comapred with a series of baselines.

Basline Methods:

(1) Source Trained: The most rudimentary baseline method considered was using a model trained on the source dataset, and directly performing inference on the target using that. This serves as the lower baseline. (2) Target Trained: Train the model from scratch on the target domain. The model high capacity model is likely to overfit to the data, thus performing poorly on the target test set. (3) Finetune N2: Finetune the weights of on the target domain, after training the model on the source domain. This is akin to the most standard method used when limited labelled data is available. (4) CORAL Loss: Despite the Deep CORAL method (described in Section 2) being targeted towards the setting of a large pool of unlabelled data, the method is still most similar to LRS-DAG. For this reason, the CORAL loss has been fit into the LRS-DAG architecture and as a loss function. This is expected to be the most optimisically performing method, as the loss minimizes second order statistics of the source and target distributions.

Additional points of note are that:

  • The method was implemented from scratch, using PyTorch 0.4, Scipy, Numpy and Scikit-Learn. No other existing implementations or frameworks were used.

  • The accuracy of the model on the hidden test set of the target domain was used as a metric of the performance of a model. Confusion Matrices were also used to further analyse the methods, but have been excluded from this work for brevity.

  • Hyperparameter tuning was done through a grid search over learning rate, weight decay, and the kind of optimizer. This performance was measured over the validation set, which was later rolled back into the training set for all methods. The validation set splits were stored and, for a particular dataset, the same data points were used as the validation set for all methods.

  • The Adam optimizer was used for all models. On average, all methods for a particular model-source-target triplet required very similar hyperparameters.

  • To account for the stochasticity of weight initializations, every experiment has been run for three trials, and their averaged results have been showcased in Table 1.

  • All models were trained until satisfying the stopping criteria of the difference in loss between two epochs being less than a particular threshold (thresholds varied, based on the type of loss function).

  • Since intermediate features in a network are not probability distributions, and the method relies on the assumption they are distributions, a softmax function is applied over the features after extracting them from the network, to convert them into a valid probability distribution.

6 Results

Experiment Set 1

: Indirect sampling of the source domain consistently outperforms random sampling. Hence, the information loss while sampling 10% points from the source is relatively large. All proposed methods were expected to have similar outcomes, but Method 3 () slightly outperforms the other methods. Method 5 has a very similar performance since it is still a very similar notion of distance that is being minimized between both methods.

A point worth noting is that, while the Target Trained outperformed other methods, LRS-DAG with Method 3 is almost the same as fine tuning. However, unlike fine tuning, the proposed method maintains generalization across both domains.

Another notable point is the weak performance of the CORAL loss in all three experiment sets. This may be because the Deep CORAL method simultaneously trains on the source and target domains, jointly minimizing estimators of the true , with roughly equal strength in both domains. With LRS-DAG, the estimate of from the source is already very accurate, which might cause the to converge to an alternate value.

Method Sampling Strategy Source Domain Target Domain Without With
Source Acc Target Acc Source Acc Target Acc
Target Trained - MNIST SVHN 25.44 13.92 6.39 34.52
Finetune N2 - MNIST SVHN 15.42 14.98 11.08 30.35
1 - MNIST SVHN 91.91 13.93 11.46 29.97
2 Indirect MNIST SVHN 91.91 13.93 13.53 29.64
2 Random MNIST SVHN 91.91 13.93 11.31 29.38
3 Indirect MNIST SVHN 91.91 13.93 12.18 30.28
3 Random MNIST SVHN 91.91 13.93 14.25 28.85
4 Indirect MNIST SVHN 91.91 13.93 10.55 29.66
4 Random MNIST SVHN 91.91 13.93 10.98 28.15
5 Indirect MNIST SVHN 91.91 13.93 11.75 29.79
5 Random MNIST SVHN 91.91 13.93 7.73 30.15
6 Indirect MNIST SVHN 91.91 13.93 11.35 19.59
6 Random MNIST SVHN 91.91 13.93 12.46 19.32
Table 1: Accuracies (in percentage, averaged over three trials) of baselines and proposed methods, for Model 1, when transferring from MNIST to SVHN.
Method Sampling Strategy Source Domain Target Domain Without With
Source Acc Target Acc Source Acc Target Acc
Finetune N2 - MNIST SVHN 16.44 18.47 13.53 26.77
1 - MNIST SVHN 93.88 20.19 24.34 21.88
2 Indirect MNIST SVHN 93.88 20.19 26.62 21.73
3 Indirect MNIST SVHN 93.88 20.19 28.52 21.85
4 Indirect MNIST SVHN 93.88 20.19 60.79 21.29
5 Indirect MNIST SVHN 93.88 20.19 29.24 21.92
6 Indirect MNIST SVHN 93.88 20.19 11.46 20.59
Table 2: Accuracies (in percentage, averaged over three trials) of baselines and proposed methods, for Model 2, when transferring from MNIST to SVHN.
Method Sampling Strategy Source Domain Target Domain Without With
Source Acc Target Acc Source Acc Target Acc
Finetune N2 - MNIST Syn-MNIST 89.27 65.42 86.12 77.33
1 - MNIST Syn-MNIST 91.91 63.13 84.52 77.09
2 Indirect MNIST Syn-MNIST 91.91 63.13 84.98 78.09
3 Indirect MNIST Syn-MNIST 91.91 63.13 85.19 78.14
4 Indirect MNIST Syn-MNIST 91.91 63.13 85.26 77.95
5 Indirect MNIST Syn-MNIST 91.91 63.13 84.88 78.11
6 Indirect MNIST Syn-MNIST 91.91 63.13 76.07 67.94
Table 3: Accuracies (in percentage, averaged over three trials) of baselines and proposed methods, for Model 1, when transferring from MNIST to Syn-MNIST.
Figure 3: Top Row: t-SNE Visualizations of raw data from MNIST, SVHN and Syn-MNIST. Second row: for source domain. Third row: . Bottom: after feature alignment. Left column: MNIST SVHN with Model 1. Middle column: MNIST Syn-MNIST with Model 1. Right column: MNIST SVHN with Model 2.

Experiment Set 2

: When using the CNN for adapting from MNIST to SVHN, it seems possible that the stopping criterion was not accurately applied. This would explain why the results on the target set in this experiment set is lower than the previous experiment set, despite the CNN being more powerful. Once again, the loss function based on KL divergence outperformed other proposed methods, and the CORAL loss. And once again, Method 3 and 5 have almost identical results. However, the results of all methods are extremely similar in this set, leading to inconclusive results. However, finetuning clearly surpasses other methods.

Experiment Set 3

: The Syn-MNIST domain has a lesser shift from MNIST than SVHN. Thus, the results seem more promising. Here, Method 3 and 5 outperform finetuning, which is the most positive result so far. It is worth noting that this is a good sign, as in low resource supervised domain adaptation, it is common to treat a highly similar domain as the source domain, and finetune over the target. LGS-DAG provides a clear benefit in this case.

7 Discussion and Conclusion

A point worth arguing would we whether it would make more sense to add to the top of the network, rather than making it deal with intermediate features. A series of experiments (involving training different parts of the network and analyzing results) showed that the features across domains differ across lower levels, indicating the positioning of lower in the network in more benificial (results excluded for brevity).

The LRS-DAG method seems comparable to fine-tuning, except for the case of the Syn-MNIST dataset, a closely related domain. However, the proposed method significantly outperforms CORAL, and most importantly, maintains generalization across both domains.

The method was inconclusive with CNNs. The performance may have been such due to a bad stopping criteria, or difficulty in aligning domains across convolutions. However, looking at the t-SNE plots of the aligned target domain after training, the points of all classes have been clustered together. Thus, the proposed method of LRS-DAG shows promise. Taking the observations so far into account, and generating a better experimental setup, may provide more promising results in the future.

A possible path to pursue in the future would be to align different domains for each class, instead of an overall domain loss. Another field to explore would be mapping this model to the unsupervised domain adaptation setting.

References

  • M. Arjovsky and L. Bottou (2017) Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862. Cited by: §1, §2.
  • Y. Aytar and A. Zisserman (2011) Tabula rasa: model transfer for object category detection. In Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 2252–2259. Cited by: §2.
  • C. J. Becker, C. M. Christoudias, and P. Fua (2013) Non-linear domain adaptation with boosting. In Advances in Neural Information Processing Systems, pp. 485–493. Cited by: §2.
  • S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan (2010) A theory of learning from different domains. Machine learning 79 (1-2), pp. 151–175. Cited by: §1.
  • A. Bergamo and L. Torresani (2010) Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach. In Advances in neural information processing systems, pp. 181–189. Cited by: §2.
  • I. Gemp and S. Mahadevan (2018) Global convergence to the equilibrium of gans using variational inequalities. arXiv preprint arXiv:1808.01531. Cited by: §2.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §2.
  • P. Haeusser, T. Frerix, A. Mordvintsev, and D. Cremers (2017) Associative domain adaptation. In International Conference on Computer Vision (ICCV), Vol. 2, pp. 6. Cited by: §1, §2.
  • J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell (2017) Cycada: cycle-consistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213. Cited by: §1, §2.
  • Y. Hoshen and L. Wolf (2018) NAM: non-adversarial unsupervised domain mapping. arXiv preprint arXiv:1806.00804. Cited by: §1.
  • E. Hosseini-Asl, Y. Zhou, C. Xiong, and R. Socher (2018) Augmented cyclic adversarial learning for domain adaptation. arXiv preprint arXiv:1807.00374. Cited by: §1.
  • J. Huang, A. Gretton, K. M. Borgwardt, B. Schölkopf, and A. J. Smola (2007) Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, pp. 601–608. Cited by: §1.
  • J. Jiang (2008)

    A literature survey on domain adaptation of statistical classifiers

    .
    URL: http://sifaka. cs. uiuc. edu/jiang4/domainadaptation/survey 3, pp. 1–12. Cited by: §1.
  • M. Liu, T. Breuel, and J. Kautz (2017)

    Unsupervised image-to-image translation networks

    .
    In Advances in Neural Information Processing Systems, pp. 700–708. Cited by: §1, §2.
  • M. Long, Y. Cao, J. Wang, and M. I. Jordan (2015) Learning transferable features with deep adaptation networks. arXiv preprint arXiv:1502.02791. Cited by: §1.
  • J. Manders, E. Marchiori, and T. van Laarhoven (2018) Simple domain adaptation with class prediction uncertainty alignment. arXiv preprint arXiv:1804.04448. Cited by: §2.
  • L. Mescheder, S. Nowozin, and A. Geiger (2017) The numerics of gans. In Advances in Neural Information Processing Systems, pp. 1825–1835. Cited by: §2.
  • S. Motiian, Q. Jones, S. Iranmanesh, and G. Doretto (2017a) Few-shot adversarial domain adaptation. In Advances in Neural Information Processing Systems, pp. 6670–6680. Cited by: §1.
  • S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto (2017b) Unified deep supervised domain adaptation and generalization. In The IEEE International Conference on Computer Vision (ICCV), Vol. 2, pp. 3. Cited by: §1.
  • V. M. Patel, R. Gopalan, R. Li, and R. Chellappa (2015) Visual domain adaptation: a survey of recent advances. IEEE signal processing magazine 32 (3), pp. 53–69. Cited by: §1.
  • B. Sun, J. Feng, and K. Saenko (2016) Return of frustratingly easy domain adaptation.. In AAAI, Vol. 6, pp. 8. Cited by: §1.
  • B. Sun, J. Feng, and K. Saenko (2017) Correlation alignment for unsupervised domain adaptation. In Domain Adaptation in Computer Vision Applications, pp. 153–171. Cited by: §2.
  • B. Sun and K. Saenko (2016) Deep coral: correlation alignment for deep domain adaptation. In European Conference on Computer Vision, pp. 443–450. Cited by: §1, §2.
  • J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint. Cited by: §1, §2.