This is the repo for the paper "Episodic Training for Domain Generalization" https://arxiv.org/abs/1902.00113
Domain generalization (DG) is the challenging and topical problem of learning models that generalize to novel testing domain with different statistics than a set of known training domains. The simple approach of aggregating data from all source domains and training a single deep neural network end-to-end on all the data provides a surprisingly strong baseline that surpasses many prior published methods. In this paper we build on this strong baseline by designing an episodic training procedure that trains a single deep network in a way that exposes it to the domain shift that characterises a novel domain at runtime. Specifically, we decompose a deep network into feature extractor and classifier components, and then train each component by simulating it interacting with a partner who is badly tuned for the current domain. This makes both components more robust, ultimately leading to our networks producing state-of-the-art performance on three DG benchmarks. As a demonstration, we consider the pervasive workflow of using an ImageNet trained CNN as a fixed feature extractor for downstream recognition tasks. Using the Visual Decathlon benchmark, we demonstrate that our episodic-DG training improves the performance of such a general purpose feature extractor by explicitly training it for robustness to novel problems. This provides the largest-scale demonstration of heterogeneous DG to date.READ FULL TEXT VIEW PDF
The well known domain shift issue causes model performance to degrade wh...
One of the main drawbacks of deep Convolutional Neural Networks (DCNN) i...
Recognition across domains has recently become an active topic in the
Deep neural networks often require copious amount of labeled-data to tra...
The problem of domain generalization is to learn from multiple training
Domain generalization aims to enhance the model robustness against domai...
This is the repo for the paper "Episodic Training for Domain Generalization" https://arxiv.org/abs/1902.00113
Machine learning methods often degrade rapidly in performance if they are applied to domains with very different statistics to the data used to train them. This is the problem of domain shift, which domain adaptation (DA) aims to address in the case where some labelled or unlabelled data from the target domain is available for adaptation [1, 37, 21, 10, 22, 3]; and domain generalisation (DG) aims to address in the case where no adaptation to the target problem is possible [27, 12, 18, 34] due to lack of data or computation. DG is a particularly challenging problem setting, since explicit training on the target is disallowed; yet it is particularly valuable due to its lack of assumptions. For example, it would be valuable to have a domain-general visual feature extractor performs well ‘of the box’ as a representation for any novel problem, without the need to fine-tune on the target problem.
The significance of the DG challenge has led to many studies in the literature. These span robust feature space learning [27, 12], model architectures that are purpose designed to enable robustness to domain shift [16, 40, 17] and specially designed learning algorithms for optimising standard architectures [34, 18] that aim to fit them to a more robust minima. Among all these efforts, it turns out that the naive approach  of aggregating all the training domains’ data together and training a single deep network end-to-end is very competitive with state of the art, and better than many published methods – while simultaneously being much simpler and faster than more elaborate alternatives. In this paper we aim to build on the strength and simplicity of this simple data aggregation strategy, but improve it by designing an episodic training scheme designed for DG.
The paradigm of episodic training has recently been popularised in the area of few-shot learning [9, 29, 35]. In this problem, the goal is to use a large amount of background source data, to train a model that is capable of few-shot learning when adapting to a novel target problem. However despite the data availability, training on all the source data would not be reflective of the target few-shot learning condition. So in order to train the model in a way that reflects how it will be tested, multiple few-shot learning training episodes are setup among all the source datasets [9, 29, 35].
How can an episodic training approach be designed for domain generalisation? Our insight is that, from the perspective of any layer in a neural network, being exposed to a novel domain at testing-time is experienced as that layer’s neighbours or being badly tuned for the problem at hand. That is, neighbours provide input to the current layer (or expect output from it) that is different to the current layer’s expectation. Therefore to design episodes for DG, we should expose layers to neighbours that are untrained for the current domain. If a layer can be trained to perform well in this situation of badly tuned neighbours, then its robustness to domain-shift has increased.
To realise our episodic training idea, we break networks up into feature extractor and classifier modules, and show that this proposed episodic training scheme leads to more robust modules that together obtain state of the art results on several DG benchmarks. Our approach benefits from supporting end-to-end learning, while being model agnostic (architecture independent), and simple and fast to train; in contrast to most existing DG techniques that rely on non-standard architectures , auxiliary models , or non-standard optimizers . Finally, we also show how our approach supports heterogeneous DG – learning from multiple sources that do not share the same label-space. This turns out to be an important capability in practice, and one which is not shared by most existing methods [16, 17].
Finally, we provide the first demonstration of the practical value of explicit DG training, beyond the simple toy benchmarks that are common in the literature. Specifically, we consider the common practitioner workflow of using an ImageNet  pre-trained CNN as a feature extractor for novel tasks. Using the Visual Decathlon benchmark  we demonstrate that DG training on multiple heterogeneous source domains can lead to an improved feature extractor that generalises better when used off-the-shelf for a variety of novel problems.
Multi-Domain Learning (MDL) MDL aims to learn several domains simultaneously using a single model [2, 30, 31, 41]. Depending on the problem, how much data is available per domain, and how similar the domains are, multi-domain learning can improve  – or sometimes worsen [2, 30, 31] – performance compared to a single model per domain. MDL is related to DG because the typical setting for DG is to assume a similar setup in that multiple source domains are provided. But that now the goal is to learn how to extract a domain-agnostic or domain-robust model from all those source domains. The most rigorous benchmark for MDL is the Visual Decathlon (VD) . We repurpose this benchmark for DG by training a CNN on a subset of the VD domains, and then evaluating its performance as a feature extractor on an unseen disjoint subset of them. We are the first to demonstrate DG at this scale, and in the heterogeneous label setting required for VD.
Domain Generalization Despite different motivating intuitions, previous DG methods can generally be divided into a few categories. Domain Invariant Features: These aim to learn a domain-invariant feature representation, typically by minimising the discrepancy between all source domains – and assuming that the resulting source-domain invariant feature will work well for the target as well. To this end  employed maximum mean discrepancy (MMD), while  proposed a multi-domain reconstruction auto-encoder to learn this domain-invariant feature. More recently, 
applied MMD constraints within the representation learning of an autoencoder via adversarial training. Hierarchical Models: These learn a hierarchical set of model parameters, so that the model for each domain is parameterised by a combination of a domain-agnostic and a domain-specific parameter[16, 17]. After learning such a hierarchical model structure on the source domains the domain agnostic parameter can be extracted as the model with the least domain-specific bias, that is most likely to work on a target problem. This intuition has been exploited in both shallow  and deep 
settings. Data Augmentation: A few studies proposed data augmentation strategies to synthesise additional training data to improve the robustness of a model to novel domains. These include the Bayesian network, which perturbs input data based on the domain classification signal from an auxiliary domain classifier. Meanwhile,  proposed an adversarial data augmentation method to synthesize the ‘hard’ data for the training model to enhance its generalization ability. Optimisation Algorithms: A final category of approach is to modify a conventional learning algorithm in an attempt to find a more robust minima during training, for example through meta-learning . Our approach is different to all of these in that it trains a standard deep model, without special data augmentation and with a conventional optimiser. The key idea requires only a simple modification of the training procedure to introduce appropriately constructed episodes. Finally, we are the first to demonstrate the impact of DG model training in large scale benchmarks such as VD.
Neural Network Meta-Learning Learning-to-learn and meta-learning methods have resurged recently, in particular in few-shot recognition [9, 35, 25], and learning-to-optimize  tasks. Despite signifiant other differences in motivation and methodological formalisations, a common feature of these methods is episodic training strategy. In the case of few-shot learning, the intuition is that while lot of source tasks and data may be available, these should be used for training in a way so as to closely simulate the testing conditions. Therefore at each learning iteration, a random subset of source tasks and instances are sampled to generate a training episode defined by a random few-shot learning task of similar data volume and cardinality as the model is expected to be tested on at runtime. Thus the model eventually ‘sees’ all the training data in aggregate, but in any given iteration, it is evaluated in a condition similar to a real ‘testing’ condition. In this paper we aim to develop an episodic training strategy to improve domain-robustness, rather than learning-to-learn. While the high-level idea of an episodic strategy is the same, the DG problem and associated episode construction details are completely different.
In this section we will first introduce the basic dataset aggregation method (AGG) which provides a strong baseline for DG performance, and then subsequently present three episodic training strategies for training it more robustly.
Problem Setting In the DG setting, we assume that we are given source domains , where is the source domain containing data-label pairs 111 indicates domain index and indicates instance number within domain. For simplicity, we will omit in the following.. The goal is to use these to learn a model that generalises well to a novel testing domain with different statistics to the training domains, without assuming any knowledge of the testing domain during model learning.
For homogeneous DG, we assume that all the source domains and the target domain share the same label space , . For the more challenging heterogeneous setting, the domains can have different, potentially completely disjoint label spaces . We will start by introducing the homogeneous case and discuss the heterogeneous case later.
Architecture Framework For our episode design, we will break a neural network classifier into a sequence modules. We use two: A feature extractor and a classifier , so that .
Vanilla Aggregation Method A simple approach to the DG problem is to simply aggregate all the source domains’ data together, and train a single CNN end-to-end ignoring the domain label information entirely. First identified explicitly in , this approach is fast, efficient, easy and competitive with more elaborate state of the art alternatives. In terms of neural network modules, it means that both the classifier and the feature extractor are shared across all domains222At least in the homogeneous case, as illustrated in Fig. 1, leading to the objective function:
where is the cross-entropy loss here.
Domain Specific Models
Our goal is to improve robustness by exposing individual modules to partners that are badly calibrated to a given domain. To obtain these ‘badly calibrated’ components, we also train domain-specific models. As illustrated in Fig. 2, this means that each domain has its own model composed of feature extractor and classifier . Each domain-specific module is only exposed to data of that corresponding domain. To train domain-specific models, we optimise:
Episodic Training Our goal is to train a domain agnostic model, as per and in the aggregation method in Eq 1. And we will design an episodic scheme that makes use of the domain-specific modules as per Eq. 2 to help the domain-agnostic model achieve the desired robustness. Specifically, we will generate episodes where each domain agnostic module and is paired with a domain-specific partner that is mismatched with the current data being input. So module and data combinations of the form and where .
To train a robust feature extractor , we ask it to learn features which are robust enough that data from domain can be processed by a classifier that has never experienced domain before as shown in Fig. 3. To generate episodes according to this criterion, we can do
where and means that is considered constant for the generation of this loss, i.e., it does not receive back-propagated gradients. This gradient-blocking is important, because without it the data from domain would ‘pollute’ the classifier which we want to retain as being naive to domains outside of .
Thus in this optimisation, only the feature extractor is penalized whenever the classifier makes the wrong prediction. That means that, for this loss to be minimised, the shared feature extractor must map data into a format that a ‘naive’ classifier can correctly classify. The feature extractor must learn to help a classifier recognize a data point that is from a domain that is novel to that classifier.
Analogous to the above, we can also interpret DG as the requirement that a classifier should be robust enough to classify data even if it is encoded by a feature extractor which has never seen this type of data in the past, as illustrated in Fig. 3. Thus to train the robust classifier we ask it to classify domain instances fed through a domain -specific feature extractor . To generate episodes according to this criterion, we do:
where and means is considered constant for generation of the loss here. Similarly to the training of the feature extractor module, this operation is important to retain the domain-specificity of feature extractor . The result is that only the classifier is penalised, and in order to minimise this loss must be robust enough to accept data that has been encoded by a naive feature extractor .
To make robust predictions, a good feature representation is crucial [23, 6]. The episodic feature training strategy above is suitable for the homogeneous DG setting, since it requires all domains to share the same label-space. Here we further introduce a novel extension that is suitable for both homogeneous and heterogeneous label-spaces.
In Section 3.2
, we introduced the notion of regularising a deep feature extractor by requiring it to support a classifier inexperienced with data from the current domain. Taking this to an extreme, we consider asking the feature extractor to support the predictions of a classifier withrandom weights, as shown in Fig. 4. To this end, our objective function here is:
where, is a randomly initialised classifier, and means it is a constant not updated in the optimization. This can be seen as an extreme version of our earlier episodic cross-domain feature extractor training (not only it has not seen any data from domain , but it has not seen any data at all). Moreover, it has the benefit of not requiring a label-space to be shared across all training domains unlike the previous method in Eq (3).
Our full algorithm brings together the domain agnostic modules that are our goal to improve and the domain-specific modules that will help train them (Section 3.1) and generates episodes according to the three strategies introduced above. Referring the losses in Eq. 1, 2, 3, 4, 5 as , , , , , then overall we optimise:
for parameters . The full pseudocode for the algorithm is given in Algorithm 1. It is noteworthy that, in practice, when training we first warm up the domain-specific branches for a few iterations before training both the domain-specific and domain-agnostic modules jointly.
Datasets We evaluate our algorithm on three different homogeneous DG benchmarks and introduce a novel and larger scale heterogeneous DG benchmark. The datasets are: IXMAX:  is cross-view action recognition task. Two object recognition benchmarks include: VLCS , which includes images from four famous datasets PASCAL VOC2007 (V) , LabelMe (L) , Caltech (C)  and SUN09 (S)  and PACS which was recently released and shown to have a larger between-domain gap than VLCS . It contains four domains covering Photo (P), Art Painting (A), Cartoon (C) and Sketch (S) images. VD: For the final benchmark we repurpose the Visual Decathlon  benchmark to evaluate DG.
Competitors We evaluate the following competitors: AGG the vanilla aggregation method, introduced in Eq. 1, trains a single model for all source domains. DICA  a kernel-based method for learning domain invariant feature representations. LRE-SVM  a SVM-based method, that trains different SVM model for each source domain. For a test domain, it uses the SVM model from the most similar source domain. D-MTAE  a de-noising multi-task auto encoder method, which learns domain invariant features by cross-domain reconstruction. DSN  Domain Separation Networks decompose the sources domains into shared and private spaces and learns them with a reconstruction signal. TF-CNN 
learns a domain-agnostic model by factoring out the common component from a set of domain-specific models, as well as tensor factorization to compress the model parameters.CCSA  uses semantic alignment to regularize the learned feature subspace. DANN  Domain Adversarial Neural Networks train a feature extractor with a domain-adversarial loss among the source domains. The source-domain invariant feature extractor is assumed to generalise better to novel target domains. MLDG  A recent meta-learning based optimization method. It mimics the DG setting by splitting source domains into meta-train and meta-test, and modifies the optimisation to improve meta-test performance. Fusion  A method that fuses the predictions from source domain classifiers for the target domain. MMD-AAE  A recent method that learns domain invariant feature autoencoding with adversarial training and ensuring that the domains are aligned using MMD constraint. Reptile  A recently proposed state of the art shortest descent-based meta-learner.
We note that DANN (domain adaptation) and Reptile (few-shot learning) are not designed for DG. However, DANN learns domain invariant features, which is natural for DG. And we found it effective for this problem. Reptile learns to maximize the inner-product of gradient updates between different batches within a task. It is related to MLDG, which maximizes the inner-product of gradient updates between source domains. Therefore we also repurpose it as a baseline. Details could be found in Appendix A.
We call our method as Episodic. We use Epi-FCR to denote our full method with (f)eature regularisation, (c)lassifier regularisation and (r)andom classifier regularisation respectively. Ablated variants such as Epi-F
denote feature regularisation alone, etc. Our method is implemented using PyTorch.
|Source||Target||DICA ||LRE-SVM ||D-MTAE ||CCSA ||MMD-AAE ||DANN||MLDG ||Reptile ||AGG||Epi-FCR|
|Source||Target||DICA ||LRE-SVM ||D-MTAE ||CCSA ||MMD-AAE||DANN ||MLDG ||Reptile ||AGG||Epi-FCR|
Setting The dataset contains 11 different human actions. All actions were video recorded by 5 cameras with different views (referred as 0,…,4). The goal is to train an action recognition model on a set of source views (domains), and recognise the action from a novel target view (domain). We follow  to keep the first 5 actions and use the same Dense trajectory features as input. For our method, we follow 
to use a one-hidden layer network with 2000 hidden neurons as our backbone and report the average result of 20 runs. The optimizer is M-SGD with learning rate 1e-4, momentum 0.9, weight decay 5e-5.
Results From the results in Table 1, we can see that: (i) The vanilla aggregation method, AGG is a strong competitor compared to several prior published methods, as are DANN and Reptile, which are newly identified by us as effective DG algorithms. (ii) Overall our Epi-FCR models performs best, improving 2.4% on AGG, and 1.1% on prior state of the art MMD-AAE. (iii) Particularly in view 1&2 our method achieves new state-of-the art performance.
Setting VLCS domains share 5 categories: bird, car, chair, dog and person. We use pre-extracted DeCAF6 features and follow 
to randomly split each domain into train (70%) and test (30%) and do leave-one-out evaluation. We use a 2 fully connected layer architecture with output size of 1024 and 128 with ReLU activation, as per and report the average performance of 20 trials. The optimizer is M-SGD with learning rate 1e-3, momentum 0.9 and weight decay 5e-5.
Results From the results in Table 2, we can see that: (i) The simple AGG baseline is again competitive with many published alternatives, as are DANN and Reptile. (ii) Our Epi-FCR method achieves the best performance, improving on AGG by 1.7% and performing comparably to prior state of the art MMD-AAE and MLDG with 0.6% improvement.
Setting PACS is a recently released dataset with different object style depictions, and a more challenging domain shift than VLCS, as shown in . This dataset shares 7 object categories across domains, including ‘dog’, ‘elephant’, ‘giraffe’, ‘guitar’, ‘house’, ‘horse’ and ‘person’. We follow the protocol in  including the recommended train and validation split for the fair comparison. We first follow  in using the ImageNet pretrained AlexNet (in Table 3) and subsequently also use a modern ImageNet pre-trained ResNet-18 (in Table 4) as a base CNN architecture. We train our network using the M-SGD optimizer (batch size/per domain=32, lr=1e-3, momentum=0.9, weight decay=5e-5) for 45k iterations when using AlexNet and train our network using same optimizer (weight decay=1e-4) for ResNet-18.
|Source||Target||DICA ||LRE-SVM ||D-MTAE ||DSN ||TF-CNN ||MLDG ||Fusion ||DANN ||AGG||Epi-FCR|
Results From the AlexNet results in Table 3, we can see that: (i) Our episodic method obtained the best performance on two held out domains C and S and comparable performance on A, P domains. (ii) It also achieves the best performance overall, with 3.3% improvement on vanilla AGG, and 1.7% improvement on prior state of the art Fusion .
Meanwhile in Table 4, we see that with a modern ResNet-18 architecture, the basic results are improved across the board as expected. However: (i) While our newly identified DANN and Reptile manage to improve on the vanilla AGG, further demonstrating their effectiveness on DG. (ii) Our full episodic method maintains the best performance overall, with a 2.4% improvement on AGG.
for DG tasks we need to be careful with batch normalization. Batchnorm accumulates statistics of the training data during training, for use at testing. In DG, the source and target domains have domain-shift between them, so different ways of employing batch norm produce different results. We tried two ways of coping with batch norm, one is directly using frozen pre-trained ImageNet statistics. Another is to unfreeze and accumulate statistics from the source domains. We observed that when training ResNet-18 on PACS with accumulating the statistics from source domains it produced a worse accuracy than freezing ImageNet statistics ( vs ).
|Source||Target||AGG||DANN ||MLDG ||Reptile ||Epi-FCR|
Ablation Study To understand the contribution of each component of our model, we perform an ablation study using PACS-AlexNet. From the results in Table 5, we can see that episodic training for the feature extractor, gives a 1.6% boost over the vanilla AGG. Including episodic training of the classifier, further improves 0.5%. Finally, combine all the episodic training components, provides 3.3% improvement over vanilla AGG. This confirms that each component of our model contributes to final performance.
Cross-Domain Testing Analysis To understand how our Epi-FCR method obtains its improved robustness to domain shift, we study its impact on cross-domain testing. Recall that when we activate the episodic training of the agnostic feature extractor and classifier, we benefit from the domain specific branches by routing domain data across domain branches. E.g., we feed: to train Eq. 3, and to train Eq. 4.
Therefore it is natural to evaluate cross-domain testing after training the models. As illustrated in Fig. 5333To save space we only display the leave-photo-out split. The others are consistent with these observations., we can see that the episodic training strategy indeed improves cross-domain testing performance. For example, when we feed domain data to domain classifier , the Episodic-trained agnostic extractor improves the performance of the domain-C classifier who has never experienced domain A data (Fig. 5, left); and similarly for the Episodic-trained agnostic classifier.
Analysis of Solution Robustness In the above experiments we confirmed that our episodic model outperforms the strong AGG baseline in a variety of benchmarks, and that each component of our framework contributes. In terms of analysing the mechanism by which episodic training improves robustness to domain shift, one possible route is through leading the model to find a higher quality minima. Several studies recently have analysed learning algorithm variants in terms of the quality of the minima that they leads a model to [15, 4].
One intuition is that converging to a ‘wide’ rather than ‘sharp’ minima provides a more robust solution, because perturbations (such as domain shift, in our case) are less likely to cause a big hit to accuracy if the model’s performance is not dependent on a very precisely calibrated solution. Following [15, 42], we therefore compare the solutions found by AGG and our Epi-FCR by adding noise to the weights of the converged model, and observing how quickly the testing accuracy decreases with the magnitude of the noise. From Fig. 6 we can see that both models performance drops as weights are perturbed, but our Epi-FCR model is more robust to weight perturbations. This suggests that the minima found by Epi-FCR is a more robust one than that found by AGG, which may explain the improved cross domain robustness of Epi-FCR compared to AGG.
|Benchmark||# of data||# of Domains||# of tasks||task space|
|Target||ImageNet PT||AGG||DANN ||MLDG ||Reptile ||Epi-R|
Visual Decathlon contains ten datasets and was initially proposed as a multi-domain learning benchmark . We re-purpose Decathlon for a more ambitious challenge of domain generalisation. As explained earlier, our motivation is find out if DG learning can improve the defacto standard ‘ImageNet trained CNN feature extractor’ for use as an off-the shelf representation for new target problems. Given the widespread usage of this workflow by vision practitioners, an improvement on a vanilla ImageNet/CNN could be of major practical value.
We compare this heterogeneous DG benchmark to the largest existing DG benchmarks by their image, domain, and task numbers in Table 6. VD-DG has twice the domains of VLCS and PACS. It is also an order of magnitude larger in terms of data and task numbers. This demonstrates the challenge and greater practical signifance of VD-DG.
Setting For the experiments, we consider a setting where the five largest scale datasets (CIFAR-100, Daimler Ped, GTSRB, Omniglot and SVHN, excluding ImageNet444We always exploit ImageNet as an initial condition, but do not include it in DG training for computational feasibility) as our source domains, and the four smallest datasets (Aircraft, D. Textures, VGG-Flowers and UCF101) as our target domains. The goal is to use DG training among the five datasets to learn a feature which outperforms the off-the-shelf ImageNet-trained CNN that we use as an initial condition. We use ResNet-18  as the backbone model, and resize all the images to 64x64 for computational efficiency. To support the VD heterogeneous label space, we assume a shared feature extractor, and a source domain-specific classifier. We perform episodic DG training among the source domains, using our (R)andom classifier model variant, which supports heterogeneous label-spaces. After DG training, the model will then be used as a fixed feature extractor for the held out target domain. These are combined by combination (concatenation and mean-pooling) with the original ImageNet pre-trained features555Since ImageNet is excluded from source domains for computational feasibility, there is loss of performance for all models compared to the original feature due to the forgetting effect.. This final feature is used to train a linear SVM for the corresponding task, as per common practice. We train the network using the M-SGD optimizer (batch size/per domain=32, lr=1e-3, momentum=0.9, weight decay=1e-4) for 100k iterations where the lr is decayed in 40k, 80k iterations by a factor 10.
Results From the results in Table 7, we observed that: (i) We do learn a feature that is more robust to novel domains compared to the standard ImageNet pre-trained features (Epi-R vs ImageNetPT improves 7.1% or 2.7% on held-out datasets). (ii) Moreover, while the vanilla AGG baseline also improves the ImageNet features, our Epi-R provides a clear improvement on AGG. (iii) In terms of other DG competitors: we note that besides MLDG , the only other competitors that we were able to feasibly run on this large scale benchmark were DANN and Reptile – methods that we first identified as re-purposeable for DG in this paper. Other methods either do not support heterogeneous label-spaces or, do not scale to this many domains, or this many examples. (iv) Overall our Epi-R method outperforms all alternatives in both average accuracy, and also the VD score recommended in preference to accuracy in . Overall this is the first demonstration that any DG method can improve robustness to domain shift in a larger-scale setting, across heterogeneous domains, and make a practical impact in surpassing ImageNet feature performance.
In this paper, we addressed the domain generalisation problem. We proposed a simple episodic training strategy that mimics the train-test domain-shift experienced in a DG scenario during training. We showed that our method achieves state of the art performance on all the main existing DG benchmarks. More importantly, we provided the first demonstration of DG’s potential value ‘in the wild’ – by demonstrating our model’s potential to improve the defacto standard ImageNet pre-trained CNN feature extractor by performing heterogeneous DG at the largest scale to date using the Visual Decathlon benchmark.
Unsupervised domain adaptation by backpropagation.In ICML, 2015.
On large-batch training for deep learning: Generalization gap and sharp minima.In ICLR, 2017.
Rethinking the inception architecture for computer vision.In CVPR, 2016.
In this appendix we explain how the Reptile algorithm , originally designed for few-shot learning, can be adapted to the DG problem setting.
We first consider homogeneous DG. The re-purposed Reptile-DG is shown in Algorithm 2. Given the source domains , in each iteration we copy the training parameters as . Then we randomly sample a mini-batch from one domain and do one-step inner SGD update on . Once we have sampled the mini-batches from all source domains, we do one-step outer update on to update it towards .
How does Reptile-DG work? If we refer the loss and parameters of inner step as and , the inner gradient of that step is . And we can get . Then we do the Taylor series of at initial point , we can get equation
the items are omitted due to their small values. If we treat the gradient and hessian of w.r.t as and , we would have
Equivalently, we get . Then together with , the Eq.7 becomes
And if we consider an example with two source domains. The gradient update of Reptile-DG is,
And when we bring Eq.9 in, we get
As mentioned in Reptile , if we take the expectation over two inner-step losses , we get the
Thus, is the direction that increases the inner product between gradients of mini-batches from different source domains, which is similar to what MLDG  does.
Benefit of Reptile-DG Unlike MLDG , in each iteration Reptile-DG just samples the mini-batches from all source domains in a random order without explicitly splitting the source domains into meta-train and meta-test sets. In addition, optimization by shortest path descent does not require second-order gradients like the meta-optimization in .
As analyzed in previous section, Reptile-DG learns to maximize the inner product of gradients of different mini-batches. It assumes that all source domains share the entire model.
In heterogeneous DG, the source domains have different task spaces and only the feature extractor module is shared. Thus, to apply Reptile-DG to the heterogeneous setting, we apply the optimization to the feature extractor alone, as shown in Algorithm 3. By maximizing the inner product of gradients w.r.t the feature extractor between different mini-batches, it improves the generalization of the feature representation. If the feature extractor and the classifiers are not flexibly decomposable, a practical way of implementing Algorithm 3 is we directly do the shortest descent optimization on the entire model similar to Algorithm 2, i.e. includes the feature extractor and all the classifiers . But, because the source domains only share the feature extractor, the GradInner item still only applies to feature extractor parameters .