Unsupervised Domain Adaptation in the Absence of Source Data

07/20/2020 ∙ by Roshni Sahoo, et al. ∙ 0

Current unsupervised domain adaptation methods can address many types of distribution shift, but they assume data from the source domain is freely available. As the use of pre-trained models becomes more prevalent, it is reasonable to assume that source data is unavailable. We propose an unsupervised method for adapting a source classifier to a target domain that varies from the source domain along natural axes, such as brightness and contrast. Our method only requires access to unlabeled target instances and the source classifier. We validate our method in scenarios where the distribution shift involves brightness, contrast, and rotation and show that it outperforms fine-tuning baselines in scenarios with limited labeled data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning methods operate under the assumption that training data and test data come from the same distribution. When this assumption is violated, the performance of a model trained on the source domain (training distribution) will degrade when tested on the target domain (test distribution) (Hendrycks and Dietterich, 2019)

. This problem is widespread; sensitivity to test-time perturbations has been shown in facial recognition software

(Karahan et al., 2016), medical image analysis (Song et al., 2017), and self-driving car vision modules (Yang et al., 2018).

Domain adaptation techniques rely on labeled source examples and no (or few) labeled target examples to address this problem (Redko et al., 2020). In unsupervised domain adaptation (UDA), no labeled target examples are available. Previous works in UDA fall into two main classes. The first class aims to align representations of the source and target domains in some feature space (Ganin and Lempitsky, 2015). The second class of methods uses generative models to transform source images to resemble target images (Bousmalis et al., 2017).

These methods can address general forms of distribution shift but remain limited by the assumption of freely available source data. The source data may be inaccessible, for example, due to contractual obligations between data owners and data customers (Chidlovskii et al., 2016). In addition, as the usage of pre-trained models rises in popularity, it is common to have access to a model but not the data on which it was trained.

Figure 1:

Training pipeline. The target image is passed to the transformation network, which predicts transformation parameters

. The transformation function

is applied to the image. Next, the loss is computed, rewarding parameters that result in a high maximum softmax probability with the source classifier

.

With stricter assumptions on the nature of the distribution shift, we propose a method for unsupervised domain adaptation in the absence of source data. We consider settings in which the target domain is shifted from the source along natural axes of variation. A realistic use case is adapting classifiers trained on medical images. Differences in protocols can cause variation in resolution, intensity profile, and contrast for MRI volumes (Kushibar et al., 2019) and chest X-rays (Lenga et al., 2020). Furthermore, chest X-rays may suffer from geometric deformations resulting from poor scan conditions, and aligning the images can improve performance on a downstream classification task (Liu et al., 2019).

Our method leverages the softmax probabilities of the source classifier to learn transformations that bring target images closer to the source domain (Figure 1). In our evaluations, we demonstrate that learning transformations can recover accuracy lost by the source classifier on various target domains. Furthermore, we find that our unsupervised method outperforms fine-tuning in label-scarce settings.

2 Problem Definition

We aim to produce transforms that map images in the target domain to the source domain, with the goal of improving accuracy on a -way classification task.

Definitions. A domain consists of an image space and a label space The source domain is , and the target domain is . Both and . A transform is a function , where the space of transformation parameters.

Distribution Shift. We model the distribution shift from source to target domain as a non-deterministic application of label-preserving forward transforms . For a fixed and , we can generate

To generate each image in , a new forward transform is sampled and applied to a source image . We restrict the choice of and such that with high probability.

Assumptions. To adapt to this distribution shift, we require the following inputs:

  1. A source classifier . The classifier’s training set is sampled from The classifier

    produces class probabilities through a softmax layer. In practice,

    can be a pre-trained neural network.

  2. A set of images , where .

  3. A class of differentiable backward transforms, . Transforms in are applied to examples from .

We assume no access to data from the source domain () or labels from the target domain ().

Learning a Transformation Network. Our goal is to recover by learning the optimal transform for each target image in . If consists of invertible functions and contains the inverses of functions in , one can recover . If not, one can approximate . Learning the transformation parameter(s) is useful because we can apply the model to the transformed test set and achieve improved accuracy compared to running inference on directly. The difficulty of this task depends on the shift severity and shift range, which are represented by and respectively.

3 Method

Our method consists of two steps: 1) learning transformation parameters that bring the target images closer to the source domain, and 2) transforming the target examples with the learned parameters and running inference on the resulting images using the source classifier .

Previous work in out-of-distribution detection demonstrates that in-distribution examples tend to have greater maximum softmax probabilities (MSP) than out-of-distribution examples (Hendrycks and Gimpel, 2017). In addition, temperature scaling, a calibration procedure where the outputs of a classifier are scaled prior to applying the softmax layer, further enlarges the MSP gap between in-distribution and out-of-distribution examples (Liang et al., 2018).

Under distribution shift, we expect most to be out-of-distribution for

. As a result, we develop a loss function that rewards predicted parameters that maximize the temperature-scaled MSP of the transformed image relative to that of the original image. Let

be the temperature-scaling constant. Given an image , we aim to maximize the MSP gap between the transformed image and the original image as follows

We aim to predict the optimal for each image by training a transformation network. The transformation network maps target examples to transformation parameters . To train the network, we minimize the following loss function

The second term of the loss is a constant with respect to , so it does not affect the optimization of the transformation network, but we include it so that the converged loss value is a meaningful quantity (a proxy for the distance between the transformed images and the original target images).

With the trained network, we predict for each image, apply the transformations to the corresponding images, and run inference on them using .

Implementation. We clamp the outputs of the transformation network so that all predicted parameter values are constrained to . We initialize the bias parameters of the network’s last layer with such that The temperature scaling constant is typically chosen using a validation set (Guo et al., 2017), but Liang et al. show that simply using a large constant is sufficient. In our method, we use Architecture and training details are provided in Section 6.4.1.

4 Experiments

First, we investigate the trade-off between fine-tuning, an adaptation technique that requires labeled target examples, and our unsupervised method. We find that we outperform fine-tuning in label-scarce settings. Second, we evaluate our method’s sensitivity to the severity and range of the distribution shift and show that the proposed method can achieve accuracies on par with a classifier trained on the target domain. In this section, we show results on CIFAR-10, where the source classifier is a ResNet-18 model. Further CIFAR-10 experiments (on fine-tuning and coping with shifts along multiple axes of variation) and MNIST experiments corroborate these results and can be found in the supplement, along with details on the experiment setup.

4.1 Setup Overview

Distribution Shift. We use the following forward transforms:

  • : Scales the brightness of an image by a factor of , where

  • : Rotates an image by degrees, where

  • : Scales the contrast of an image by a factor of , where

To simulate distribution shift, we use one or more forward transforms from above. For each selected transform, we pick a corresponding and , which govern the distribution of forward transform parameters. We express the forward transforms for a brightness shift as

Overloading the notation, we apply to a dataset as follows

We use the same notation for contrast () and rotation () shifts, as well.

Backward Transforms. For all experiments, we assume that the distribution shift occurs along the axes of rotation, brightness, and contrast. Accordingly, we set the class of backward transforms to be

Datasets.

We consider the CIFAR-10 dataset and use the pre-processing pipeline from the PyTorch model zoo

(Paszke et al., 2019). We construct target domains by varying the contrast and brightness of the CIFAR-10 test set. We select these target domains because contrast and brightness changes are common corruptions on natural images. We do not evaluate adaptation to rotation shift on natural images because there is an artificial correlation between the optimal transformation and the size of the black artifacts at the corners of the rotated image.

Baselines. For each target domain, we assess performance of two baselines. The first baseline is the source classifier trained on the CIFAR-10 training set. The second baseline is an oracle model trained on the target domain. We expect the oracle model to outperform our method because it is tested on the domain on which it is trained. To generate the training dataset for the oracle model, we apply forward transforms to the CIFAR-10 training set. The source classifier and oracle model have the same architecture.

4.2 Comparison to Fine-Tuning

Figure 2: Accuracy achieved by fine-tuning a source classifier with annotated target examples. Our unsupervised method outperforms fine-tuning the last layer (FT Last Layer) when there are less than labeled target examples and fine-tuning the entire source classifier (FT Network) when there are less labeled target examples.

Fine-tuning is a common technique for adapting a source classifier to a target domain in the presence of labeled target examples (Chu et al., 2016). We compare our method, which does not use any labeled target examples, to fine-tuning the source classifier on labeled target examples.

Baselines. In addition to the baselines in Section 4.1, we compare our method to two fine-tuning schemes. In both schemes, a ResNet-18 model is initialized with the source classifier’s weights and is trained on labeled target examples. In the first scheme, we fine-tune the last layer, freezing all other model weights. In the second scheme, we fine-tune the entire network, permitting all model weights to be updated.

Evaluation. We consider mild and severe shifts along the axes of brightness and contrast. For mild shifts, the target domains are generated by applying the forward transforms and to the CIFAR-10 test set. For severe shifts, we apply and . These target domains are low brightness and low contrast settings; experiments on high brightness and high contrast target domains are in Section 6.2.1.

We evaluate each method on of the examples from the target domain and produce error bars through repeated subsampling. Our unsupervised method is trained on images from the remaining of the target domain. The fine-tuning baselines are trained on labeled examples from the same 70% of the target domain. Of the labeled examples, one-fifth are used for validation. Note that in real-world deployment of our unsupervised method, an entire unlabeled test set can be used for both training and inference.

Results. Across these shifts, our method, which uses labeled target examples, outperforms fine-tuning the last layer of the source classifier when there are less than labeled target examples and fine-tuning the entire network when there are less than labeled target examples (Figure 2). As the number of labeled target examples increases, the fine-tuning methods improve accuracy on the target domain. The fine-tuning approaches and our method improve model performance relative to the source classifier.

4.3 Effect of Shift Severity and Range

Figure 3: Performance of our method as the distribution shift varies in severity (leftmost plots) and in range (rightmost plots).

Baselines. We compare our method to the two baselines described in Section 4.1.

Evaluation. We capture the shift severity and range with and

, the mean and standard deviation of the forward transform parameters applied to generate the target domain. Let

be the original CIFAR-10 test set. For the shift severity experiments, we assess performance on the following target domains

For the shift range experiments, we assess performance on the following target domains

In the shift range experiments, we set to simulate a mild distribution shift (in contrast, with , most generated examples are still in-distribution). The target domains generated by contrast shifts can be defined analogously. We evaluate the methods on random subsamples of of the target domain. Images from the remaining are used to train our unsupervised method.

Shift Severity Results. While the performance of the source classifier declines as moves further from the default setting of (Figure 3, leftmost plots), our method is often able to recover the lost accuracy. Our method achieves similar accuracy to the oracle model for all contrast shifts and for brightness shifts where Although it performs better than the source classifier for brightness shifts where , our method does not match the accuracy of the oracle model in this range. Our method is limited by how well it can reverse the effect of the forward transform; in this case, we cannot easily add color back to an overexposed image.

Shift Range Results. As shift range increases, we observe that all methods decline in accuracy (Figure 3, rightmost plots). As the brightness shift range increases, the accuracy of our method declines more gradually than that of the source classifier. As the contrast shift range increases, both decrease in accuracy at a similar rate.

5 Conclusion

In contrast to previous UDA methods which rely on source data, we demonstrate that unlabeled data from a target domain and a source classifier can be leveraged to adapt to distribution shift along natural axes. This work may have applications in medical imaging, where target domain annotations are costly and data from the source domain is confidential and unavailable. Our future work includes extending our method to cope with more corruptions suggested by Hendrycks and Dietterich and using our method to cope with bias field corruption in MRI images (Song et al., 2017).

References

  • K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan (2017)

    Unsupervised pixel-level domain adaptation with generative adversarial networks

    .
    pp. 95–104. External Links: Document Cited by: §1.
  • B. Chidlovskii, S. Clinchant, and G. Csurka (2016) Domain adaptation in the absence of source domain data. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, New York, NY, USA, pp. 451–460. External Links: ISBN 9781450342322, Link, Document Cited by: §1, §6.1.
  • B. Chu, V. Madhavan, O. Beijbom, J. Hoffman, and T. Darrell (2016) Best practices for fine-tuning visual classifiers to new domains. In ECCV Workshops, Cited by: §4.2.
  • G. Csurka, B. Chidlovskii, and S. Clinchant (2015) Adapted domain specific class means. pp. 80–84. External Links: Document Cited by: §6.1.
  • Y. Ganin and V. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    .
    In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 1180–1189. External Links: Link Cited by: §1.
  • C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 1321–1330. External Links: Link Cited by: §3.
  • D. Hendrycks and T. G. Dietterich (2019) Benchmarking neural network robustness to common corruptions and perturbations. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: §1, §5.
  • D. Hendrycks and K. Gimpel (2017) A baseline for detecting misclassified and out-of-distribution examples in neural networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §3, §6.1.
  • S. Karahan, M. Kilinc Yildirum, K. Kirtac, F. S. Rende, G. Butun, and H. K. Ekenel (2016) How image degradations affect deep cnn-based face recognition?. In 2016 International Conference of the Biometrics Special Interest Group (BIOSIG), Vol. , pp. 1–5. Cited by: §1.
  • N. Karani, K. Chaitanya, and E. Konukoglu (2020) Test-time adaptable neural networks for robust medical image segmentation. arXiv preprint arXiv:2004.04668. Cited by: §6.1.
  • K. Kushibar, S. Valverde, S. González-Villà, J. Bernal, M. Cabezas, A. Oliver, and X. Lladó (2019) Supervised domain adaptation for automatic sub-cortical brain structure segmentation with minimal user interaction. Scientific Reports 9. Cited by: §1.
  • M. Lenga, H. Schulz, and A. Saalbach (2020) Continual learning for domain adaptation in chest x-ray classification. In

    Medical Imaging with Deep Learning

    ,
    External Links: Link Cited by: §1.
  • S. Liang, Y. Li, and R. Srikant (2018) Enhancing the reliability of out-of-distribution image detection in neural networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §3, §3, §6.1.
  • J. Liu, G. Zhao, Y. Fei, M. Zhang, Y. Wang, and Y. Yu (2019) Align, attend and locate: chest x-ray diagnosis via contrast induced attention network with limited supervision. pp. . External Links: Document Cited by: §1.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. Cited by: §4.1.
  • I. Redko, E. Morvant, A. Habrard, M. Sebban, and Y. Bennani (2020) A survey on domain adaptation theory. CoRR abs/2004.11829. External Links: Link, 2004.11829 Cited by: §1.
  • M. Sensoy, L. Kaplan, and M. Kandemir (2018) Evidential deep learning to quantify classification uncertainty. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 3179–3189. Cited by: §6.1.
  • S. Song, Y. Zheng, and Y. He (2017) A review of methods for bias correction in medical images. Biomedical Engineering Review 1 (1). Cited by: §1, §5.
  • Y. Sun, X. Wang, Z. Liu, J. Miller, A. A. Efros, and M. Hardt (2019) Test-time training for out-of-distribution generalization. arXiv preprint arXiv:1909.13231. Cited by: §6.1.
  • R. Volpi, H. Namkoong, O. Sener, J. C. Duchi, V. Murino, and S. Savarese (2018) Generalizing to unseen domains via adversarial data augmentation. In Advances in Neural Information Processing Systems, pp. 5334–5344. Cited by: §6.1.
  • L. Yang, X. Liang, T. Wang, and E. P. Xing (2018) Real-to-virtual domain unification for end-to-end autonomous driving. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part IV, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss (Eds.), Lecture Notes in Computer Science, Vol. 11208, pp. 553–570. External Links: Link, Document Cited by: §1.

6 Supplementary Material

6.1 Related Work

We give an overview of domain adaptation in the absence of source data. Our method draws inspiration from previous works in the out-of-distribution detection literature, so we outline the parallels.

Domain Adaptation without Source Data. In this setting, practitioners lack access to any data from the source domain. The lack of source data in this task distinguishes it from the majority of existing domain adaptation methods. With no or limited access to source data, existing domain adaptation methods consider the source classifier’s decisions as augmented features of the target data (Chidlovskii et al., 2016). Our approach similarly leverages the source classifier’s predicted class probabilities to perform adaptation.

Our method is also unsupervised, which further assumes that we do not have access to labeled instances of the target domain. Work by Csurka et al.

trains marginalized stacked denoising autoencoders (mSDA) on the unlabeled target data and aggregations of the source data. Although this approach has low computational cost, it is not directly applicable to visual data (images). Since the method operates on feature vectors, either the images must be flattened into vectors or embeddings of images must be generated. After that, the adaptation is performed on the vectors.

Our work is most closely related to a method that operates by updating model parameters using a self-supervised loss during test-time (Sun et al., 2019). While the method does not require source data, it assumes the inclusion of the self-supervised loss in the original model’s training regime. Our approach does not modify the original training process and thus, can be applied to pre-trained models.

Out-of-Distribution Detection Many works in out-of-distribution detection use softmax probabilities of classification models to determine whether an input is out-of-distribution (Hendrycks and Gimpel, 2017; Liang et al., 2018). Although the prediction probability from a softmax distribution has a poor direct correspondence to confidence (Sensoy et al., 2018), Hendrycks and Gimpel demonstrate that the maximum softmax probability (MSP) of out-of-distribution examples tends to be lower than the MSP of in-distribution examples, so MSP statistics are often sufficient for detecting whether an example is abnormal.

Our method is similar to these out-of-distribution detection works because we use the MSP to quantify whether a transformation brings target instances closer to the source domain. This practice is common—out-of-distribution metrics are the foundation of multiple methods to address domain shift (Volpi et al., 2018; Karani et al., 2020). We solve a different problem; we are interested in correcting out-of-distribution inputs.

6.2 Additional CIFAR-10 Results

6.2.1 Comparison to Fine-Tuning

Figure 4: The target domains represent modest overexposures and contrast shifts. Fine-tuning (the entire network or only the last layer) on too few examples in the case of subtle shifts, and , degrades model accuracies below the performance of the original classifier, suggesting that our method may be a useful alternative in label-scarce settings.

Continuing the experiments of Section 4.2, we compare the performance of fine-tuning to our method on high brightness and high contrast target domains.

Baselines. We use the same baselines as described in Section 4.2.

Evaluation. Brightness transforms with may not be label-preserving because excessive overexposure will remove relevant information for classification, so we consider modest overexposures. We evaluate on target domains generated by applying the forward transforms , , and to the CIFAR-10 test set. Otherwise, our evaluation method is the same as in Section 4.2.

Results. We observe that our method provides small improvements over the source classifier in these experiments (Figure 4). These improvements are not as pronounced as in Section 4.2. We hypothesize that this is because the target and source domains are more similar, so it is more difficult to detect whether examples are out-of-distribution. In addition, as mentioned in Section 4.3, the brightness transform is not invertible for , so our method can at best approximate the source domain and does not match the performance of the oracle model. At the same time, fine-tuning (the entire network or only the last layer) on too few examples in the case of subtle shifts and results in lower accuracy than the original source classifier (Figure 4- leftmost and middle right plot). This suggests that our method may be a useful alternative for coping with slight perturbations in settings where there are fewer than labeled examples.

6.2.2 Effect of Shifts Along Multiple Axes

Baselines. We use the same baselines as in Section 4.1, the source classifier and oracle models.

Figure 5: The target domain varies from the source along the axes of both brightness and contrast. In the first set of experiments, we construct the target domains by fixing the mean brightness of the target domains and sweeping over different mean contrasts. In the second set of experiments, we construct the target domains by fixing the mean contrast and sweeping over different mean brightness

Evaluation. Let be the original CIFAR-10 test set. We construct the target domains as follows

In one set of experiments, we vary while setting . In another set of experiments, we vary while setting .

Results. We observe that our method recovers accuracy lost by the source classifier on the target domains (Figure 5). Similar to Section 4.3, we see that when our method offers an improvement over the source classifier but does not recover full accuracy.

6.3 MNIST Results

6.3.1 Comparison to Fine-Tuning

Baselines. We compare our unsupervised method to the source classifier trained on the MNIST training set and oracle models as described in Section 4.1. The training data for each oracle model is generated by applying the corresponding forward transforms to the MNIST training set. Additionally, we compare to the fine-tuning methods described in Section 4.2.

Evaluation. We consider mild and severe shifts along the axis of rotation. For the mild shift, the target domain is generated by applying the forward transforms to the MNIST test set. For the severe shift, we apply Otherwise, the evaluation method is identical to Section 4.2.

Figure 6: Accuracy achieved by fine-tuning a source classifier with labeled target examples. Fine-tuning the network (FT Network) requires at least examples, and fine-tuning the last layer (FT Last Layer) does not achieve the same accuracy as our method even when all labeled examples are used.

Results. Across the mild and severe shifts, our unsupervised method excels compared to fine-tuning in the presence few labels. We outperform fine-tuning the final layer even when all labeled target examples are available and fine-tuning the network when there are fewer than target examples available (Figure 6). Our method performs on par with the oracle models in these experiments. As in Section 4.2, the fine-tuning schemes and our method demonstrate accuracy improvements relative to the source classifier.

6.3.2 Effect of Shift Severity and Shift Range

Baselines. As described in Section 4.1, we compare our method to 1) the source classifier trained on the MNIST training set and 2) oracle models, each trained on a target domain.

Evaluation. Let be the original MNIST test set. For the shift severity experiments, we construct the target domains as follows

We limit degrees because for large angles, the forward transform is not label-preserving. For the shift range experiments, we construct the target domains as

We set in the shift range experiments to simulate a mild distribution shift (in contrast, with , most of the generated examples are in-distribution for the source classifier).

Figure 7: We evaluate how our method performs compared to the oracle model and the source classifier as the mean and standard deviation of the rotation shift changes.

Shift Severity Results. Our method recovers full accuracy, performing as well as the oracle model on these target domains (Figure 7). Our method is especially successful on this set of target domains because the rotation transform is invertible on MNIST.

Shift Range Results. The source classifier’s performance declines drastically as the shift range increases. In contrast, the performance of our method and that of the oracle models decline gradually (Figure 7).

6.4 Experiment Details

6.4.1 Transformation Network Training

The transformation network is a CNN. For CIFAR-10, we use the following architecture

  1. Convolutional layer with input channels, output channels, kernel size

    , and stride

    .

  2. Maxpool layer with kernel size and stride .

  3. Convolutional layer with input channels, output channels, kernel size , and stride .

  4. Linear layer with output size .

  5. Linear layer with output size .

  6. Linear layer with output size equal to number of transformation parameters.

For MNIST, we modify the architecture slightly for single-channel images.

  1. Convolutional layer with input channels, output channels, kernel size , and stride .

  2. Convolutional layer with input channels, output channels, kernel size , and stride .

  3. Linear layer with output size .

  4. Linear layer with output size .

  5. Linear layer with output size equal to number of transformation parameters.

We optimize the network weights using the Adam optimizer with learning rate 5e-5 and train for 30 epochs.

6.4.2 Source Classifier Training

We train each dataset’s source classifier on the respective training set. The source classifier is trained on and validated on of the respective training set. Both source classifiers are trained until convergence. Classifiers are optimized using the Adam optimizer with learning rate 1e-3. For the CIFAR-10 experiments, the source classifier is a Resnet-18 model. In the MNIST experiments, the source classifier is a CNN with the following architecture:

  1. Convolutional layer with input channels, output channels, kernel size , and stride .

  2. Convolutional layer with input channels, output channels, kernel size , and stride .

  3. Linear layer with output size .

  4. Linear layer with output size .

  5. Linear layer with output size equal to number of classes.

When training on MNIST, no data augmentation is applied. When training on CIFAR-10, standard data augmentation is applied (horizontal flips, random crops).

6.4.3 Oracle Model Training

The oracle model is trained using the same procedure as the source classifier, except that the oracle model is trained on the target domain. As described in Section 4.1, the training data for the oracle model is generated by applying forward transforms to the respective training dataset.