Episodic Training for Domain Generalization

01/31/2019 ∙ by Da Li, et al. ∙ Queen Mary University of London Anhui USTC iFLYTEK Co 10

Domain generalization (DG) is the challenging and topical problem of learning models that generalize to novel testing domain with different statistics than a set of known training domains. The simple approach of aggregating data from all source domains and training a single deep neural network end-to-end on all the data provides a surprisingly strong baseline that surpasses many prior published methods. In this paper we build on this strong baseline by designing an episodic training procedure that trains a single deep network in a way that exposes it to the domain shift that characterises a novel domain at runtime. Specifically, we decompose a deep network into feature extractor and classifier components, and then train each component by simulating it interacting with a partner who is badly tuned for the current domain. This makes both components more robust, ultimately leading to our networks producing state-of-the-art performance on three DG benchmarks. As a demonstration, we consider the pervasive workflow of using an ImageNet trained CNN as a fixed feature extractor for downstream recognition tasks. Using the Visual Decathlon benchmark, we demonstrate that our episodic-DG training improves the performance of such a general purpose feature extractor by explicitly training it for robustness to novel problems. This provides the largest-scale demonstration of heterogeneous DG to date.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


This is the repo for the paper "Episodic Training for Domain Generalization" https://arxiv.org/abs/1902.00113

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning methods often degrade rapidly in performance if they are applied to domains with very different statistics to the data used to train them. This is the problem of domain shift, which domain adaptation (DA) aims to address in the case where some labelled or unlabelled data from the target domain is available for adaptation [1, 37, 21, 10, 22, 3]; and domain generalisation (DG) aims to address in the case where no adaptation to the target problem is possible [27, 12, 18, 34] due to lack of data or computation. DG is a particularly challenging problem setting, since explicit training on the target is disallowed; yet it is particularly valuable due to its lack of assumptions. For example, it would be valuable to have a domain-general visual feature extractor performs well ‘of the box’ as a representation for any novel problem, without the need to fine-tune on the target problem.

The significance of the DG challenge has led to many studies in the literature. These span robust feature space learning [27, 12], model architectures that are purpose designed to enable robustness to domain shift [16, 40, 17] and specially designed learning algorithms for optimising standard architectures [34, 18] that aim to fit them to a more robust minima. Among all these efforts, it turns out that the naive approach [17] of aggregating all the training domains’ data together and training a single deep network end-to-end is very competitive with state of the art, and better than many published methods – while simultaneously being much simpler and faster than more elaborate alternatives. In this paper we aim to build on the strength and simplicity of this simple data aggregation strategy, but improve it by designing an episodic training scheme designed for DG.

The paradigm of episodic training has recently been popularised in the area of few-shot learning [9, 29, 35]. In this problem, the goal is to use a large amount of background source data, to train a model that is capable of few-shot learning when adapting to a novel target problem. However despite the data availability, training on all the source data would not be reflective of the target few-shot learning condition. So in order to train the model in a way that reflects how it will be tested, multiple few-shot learning training episodes are setup among all the source datasets [9, 29, 35].

How can an episodic training approach be designed for domain generalisation? Our insight is that, from the perspective of any layer in a neural network, being exposed to a novel domain at testing-time is experienced as that layer’s neighbours or being badly tuned for the problem at hand. That is, neighbours provide input to the current layer (or expect output from it) that is different to the current layer’s expectation. Therefore to design episodes for DG, we should expose layers to neighbours that are untrained for the current domain. If a layer can be trained to perform well in this situation of badly tuned neighbours, then its robustness to domain-shift has increased.

To realise our episodic training idea, we break networks up into feature extractor and classifier modules, and show that this proposed episodic training scheme leads to more robust modules that together obtain state of the art results on several DG benchmarks. Our approach benefits from supporting end-to-end learning, while being model agnostic (architecture independent), and simple and fast to train; in contrast to most existing DG techniques that rely on non-standard architectures [17], auxiliary models [34], or non-standard optimizers [18]. Finally, we also show how our approach supports heterogeneous DG – learning from multiple sources that do not share the same label-space. This turns out to be an important capability in practice, and one which is not shared by most existing methods [16, 17].

Finally, we provide the first demonstration of the practical value of explicit DG training, beyond the simple toy benchmarks that are common in the literature. Specifically, we consider the common practitioner workflow of using an ImageNet [32] pre-trained CNN as a feature extractor for novel tasks. Using the Visual Decathlon benchmark [30] we demonstrate that DG training on multiple heterogeneous source domains can lead to an improved feature extractor that generalises better when used off-the-shelf for a variety of novel problems.

2 Related Work

Multi-Domain Learning (MDL)  MDL aims to learn several domains simultaneously using a single model [2, 30, 31, 41]. Depending on the problem, how much data is available per domain, and how similar the domains are, multi-domain learning can improve [41] – or sometimes worsen [2, 30, 31] – performance compared to a single model per domain. MDL is related to DG because the typical setting for DG is to assume a similar setup in that multiple source domains are provided. But that now the goal is to learn how to extract a domain-agnostic or domain-robust model from all those source domains. The most rigorous benchmark for MDL is the Visual Decathlon (VD) [30]. We repurpose this benchmark for DG by training a CNN on a subset of the VD domains, and then evaluating its performance as a feature extractor on an unseen disjoint subset of them. We are the first to demonstrate DG at this scale, and in the heterogeneous label setting required for VD.

Domain Generalization  Despite different motivating intuitions, previous DG methods can generally be divided into a few categories. Domain Invariant Features: These aim to learn a domain-invariant feature representation, typically by minimising the discrepancy between all source domains – and assuming that the resulting source-domain invariant feature will work well for the target as well. To this end [27] employed maximum mean discrepancy (MMD), while [12] proposed a multi-domain reconstruction auto-encoder to learn this domain-invariant feature. More recently, [20]

applied MMD constraints within the representation learning of an autoencoder via adversarial training. Hierarchical Models: These learn a hierarchical set of model parameters, so that the model for each domain is parameterised by a combination of a domain-agnostic and a domain-specific parameter

[16, 17]. After learning such a hierarchical model structure on the source domains the domain agnostic parameter can be extracted as the model with the least domain-specific bias, that is most likely to work on a target problem. This intuition has been exploited in both shallow [16] and deep [17]

settings. Data Augmentation: A few studies proposed data augmentation strategies to synthesise additional training data to improve the robustness of a model to novel domains. These include the Bayesian network

[34], which perturbs input data based on the domain classification signal from an auxiliary domain classifier. Meanwhile, [38] proposed an adversarial data augmentation method to synthesize the ‘hard’ data for the training model to enhance its generalization ability. Optimisation Algorithms: A final category of approach is to modify a conventional learning algorithm in an attempt to find a more robust minima during training, for example through meta-learning [18]. Our approach is different to all of these in that it trains a standard deep model, without special data augmentation and with a conventional optimiser. The key idea requires only a simple modification of the training procedure to introduce appropriately constructed episodes. Finally, we are the first to demonstrate the impact of DG model training in large scale benchmarks such as VD.

Neural Network Meta-Learning  Learning-to-learn and meta-learning methods have resurged recently, in particular in few-shot recognition [9, 35, 25], and learning-to-optimize [29] tasks. Despite signifiant other differences in motivation and methodological formalisations, a common feature of these methods is episodic training strategy. In the case of few-shot learning, the intuition is that while lot of source tasks and data may be available, these should be used for training in a way so as to closely simulate the testing conditions. Therefore at each learning iteration, a random subset of source tasks and instances are sampled to generate a training episode defined by a random few-shot learning task of similar data volume and cardinality as the model is expected to be tested on at runtime. Thus the model eventually ‘sees’ all the training data in aggregate, but in any given iteration, it is evaluated in a condition similar to a real ‘testing’ condition. In this paper we aim to develop an episodic training strategy to improve domain-robustness, rather than learning-to-learn. While the high-level idea of an episodic strategy is the same, the DG problem and associated episode construction details are completely different.

3 Methodology

Figure 1: Illustration of vanilla domain-aggregation strategy for multi-domain learning. A single classifier is trained to classify data from all domains.

In this section we will first introduce the basic dataset aggregation method (AGG) which provides a strong baseline for DG performance, and then subsequently present three episodic training strategies for training it more robustly.

Problem Setting  In the DG setting, we assume that we are given source domains , where is the source domain containing data-label pairs 111 indicates domain index and indicates instance number within domain. For simplicity, we will omit in the following.. The goal is to use these to learn a model that generalises well to a novel testing domain with different statistics to the training domains, without assuming any knowledge of the testing domain during model learning.

For homogeneous DG, we assume that all the source domains and the target domain share the same label space , . For the more challenging heterogeneous setting, the domains can have different, potentially completely disjoint label spaces . We will start by introducing the homogeneous case and discuss the heterogeneous case later.

Architecture Framework  For our episode design, we will break a neural network classifier into a sequence modules. We use two: A feature extractor and a classifier , so that .

3.1 Overview

Vanilla Aggregation Method  A simple approach to the DG problem is to simply aggregate all the source domains’ data together, and train a single CNN end-to-end ignoring the domain label information entirely. First identified explicitly in [17], this approach is fast, efficient, easy and competitive with more elaborate state of the art alternatives. In terms of neural network modules, it means that both the classifier and the feature extractor are shared across all domains222At least in the homogeneous case, as illustrated in Fig. 1, leading to the objective function:


where is the cross-entropy loss here.

Domain Specific Models

Figure 2: Illustration of domain-specific branches. One classifier and feature extractor are trained per-domain.

Our goal is to improve robustness by exposing individual modules to partners that are badly calibrated to a given domain. To obtain these ‘badly calibrated’ components, we also train domain-specific models. As illustrated in Fig. 2, this means that each domain has its own model composed of feature extractor and classifier . Each domain-specific module is only exposed to data of that corresponding domain. To train domain-specific models, we optimise:


Episodic Training  Our goal is to train a domain agnostic model, as per and in the aggregation method in Eq 1. And we will design an episodic scheme that makes use of the domain-specific modules as per Eq. 2 to help the domain-agnostic model achieve the desired robustness. Specifically, we will generate episodes where each domain agnostic module and is paired with a domain-specific partner that is mismatched with the current data being input. So module and data combinations of the form and where .

3.2 Episodic Training of Feature Extractor

To train a robust feature extractor , we ask it to learn features which are robust enough that data from domain can be processed by a classifier that has never experienced domain before as shown in Fig. 3. To generate episodes according to this criterion, we can do


where and means that is considered constant for the generation of this loss, i.e., it does not receive back-propagated gradients. This gradient-blocking is important, because without it the data from domain would ‘pollute’ the classifier which we want to retain as being naive to domains outside of .

Thus in this optimisation, only the feature extractor is penalized whenever the classifier makes the wrong prediction. That means that, for this loss to be minimised, the shared feature extractor must map data into a format that a ‘naive’ classifier can correctly classify. The feature extractor must learn to help a classifier recognize a data point that is from a domain that is novel to that classifier.

3.3 Episodic Training of Classifier

Figure 3: Episodic training for feature and classifier regularisation. The shared feature extractor feeds domain specific classifiers. The shared classifier reads domain-specific feature extractors.

Analogous to the above, we can also interpret DG as the requirement that a classifier should be robust enough to classify data even if it is encoded by a feature extractor which has never seen this type of data in the past, as illustrated in Fig. 3. Thus to train the robust classifier we ask it to classify domain instances fed through a domain -specific feature extractor . To generate episodes according to this criterion, we do:


where and means is considered constant for generation of the loss here. Similarly to the training of the feature extractor module, this operation is important to retain the domain-specificity of feature extractor . The result is that only the classifier is penalised, and in order to minimise this loss must be robust enough to accept data that has been encoded by a naive feature extractor .

3.4 Episodic Training by Random Classifier

To make robust predictions, a good feature representation is crucial [23, 6]. The episodic feature training strategy above is suitable for the homogeneous DG setting, since it requires all domains to share the same label-space. Here we further introduce a novel extension that is suitable for both homogeneous and heterogeneous label-spaces.

In Section 3.2

, we introduced the notion of regularising a deep feature extractor by requiring it to support a classifier inexperienced with data from the current domain. Taking this to an extreme, we consider asking the feature extractor to support the predictions of a classifier with

random weights, as shown in Fig. 4. To this end, our objective function here is:


where, is a randomly initialised classifier, and means it is a constant not updated in the optimization. This can be seen as an extreme version of our earlier episodic cross-domain feature extractor training (not only it has not seen any data from domain , but it has not seen any data at all). Moreover, it has the benefit of not requiring a label-space to be shared across all training domains unlike the previous method in Eq (3).

Specifically, in Eq. 3, the routing requires to have a label-space matching . But for Eq. 5, each domain can be equipped with its own random classifier with a cardinality matching its normal label-space. This property makes Eq. 5 suitable for heterogeneous domains.

Figure 4: The architecture of the random classifier regularization.

3.5 Algorithm Flow

Our full algorithm brings together the domain agnostic modules that are our goal to improve and the domain-specific modules that will help train them (Section 3.1) and generates episodes according to the three strategies introduced above. Referring the losses in Eq. 1, 2, 3, 4, 5 as , , , , , then overall we optimise:


for parameters . The full pseudocode for the algorithm is given in Algorithm 1. It is noteworthy that, in practice, when training we first warm up the domain-specific branches for a few iterations before training both the domain-specific and domain-agnostic modules jointly.

2:Initialise hyper parameters:
3:Initialise model parameters: domain specific modules and ; AGG modules ; random classifier
4:while not done training do
5:     for  do
6:         Update
7:         Update
8:     end for
9:     Update
10:     Update
11:end while
Algorithm 1 Episodic training

4 Experiments

4.1 Datasets and Settings

Datasets  We evaluate our algorithm on three different homogeneous DG benchmarks and introduce a novel and larger scale heterogeneous DG benchmark. The datasets are: IXMAX: [39] is cross-view action recognition task. Two object recognition benchmarks include: VLCS [8], which includes images from four famous datasets PASCAL VOC2007 (V) [7], LabelMe (L) [33], Caltech (C) [19] and SUN09 (S) [5] and PACS which was recently released and shown to have a larger between-domain gap than VLCS [17]. It contains four domains covering Photo (P), Art Painting (A), Cartoon (C) and Sketch (S) images. VD: For the final benchmark we repurpose the Visual Decathlon [30] benchmark to evaluate DG.

Competitors  We evaluate the following competitors: AGG the vanilla aggregation method, introduced in Eq. 1, trains a single model for all source domains. DICA [27] a kernel-based method for learning domain invariant feature representations. LRE-SVM [40] a SVM-based method, that trains different SVM model for each source domain. For a test domain, it uses the SVM model from the most similar source domain. D-MTAE [12] a de-noising multi-task auto encoder method, which learns domain invariant features by cross-domain reconstruction. DSN [3] Domain Separation Networks decompose the sources domains into shared and private spaces and learns them with a reconstruction signal. TF-CNN [17]

learns a domain-agnostic model by factoring out the common component from a set of domain-specific models, as well as tensor factorization to compress the model parameters.

CCSA [26] uses semantic alignment to regularize the learned feature subspace. DANN [11] Domain Adversarial Neural Networks train a feature extractor with a domain-adversarial loss among the source domains. The source-domain invariant feature extractor is assumed to generalise better to novel target domains. MLDG [18] A recent meta-learning based optimization method. It mimics the DG setting by splitting source domains into meta-train and meta-test, and modifies the optimisation to improve meta-test performance. Fusion [24] A method that fuses the predictions from source domain classifiers for the target domain. MMD-AAE [20] A recent method that learns domain invariant feature autoencoding with adversarial training and ensuring that the domains are aligned using MMD constraint. Reptile [28] A recently proposed state of the art shortest descent-based meta-learner.

We note that DANN (domain adaptation) and Reptile (few-shot learning) are not designed for DG. However, DANN learns domain invariant features, which is natural for DG. And we found it effective for this problem. Reptile learns to maximize the inner-product of gradient updates between different batches within a task. It is related to MLDG, which maximizes the inner-product of gradient updates between source domains. Therefore we also repurpose it as a baseline. Details could be found in Appendix A.

We call our method as Episodic. We use Epi-FCR to denote our full method with (f)eature regularisation, (c)lassifier regularisation and (r)andom classifier regularisation respectively. Ablated variants such as Epi-F

denote feature regularisation alone, etc. Our method is implemented using PyTorch.

Source Target DICA [27] LRE-SVM [40] D-MTAE [12] CCSA [26] MMD-AAE [20] DANN[11] MLDG [18] Reptile [28] AGG Epi-FCR
0,1,2,3 4 61.5 75.8 78.0 75.8 79.1 75.0 70.7 70.5 73.1 76.9
0,1,2,4 3 72.5 86.9 92.3 92.3 94.5 94.1 93.6 93.5 94.2 94.8
0,1,3,4 2 74.7 84.5 91.2 94.5 95.6 97.3 97.5 96.9 95.7 99.0
0,2,3,4 1 67.0 83.4 90.1 91.2 93.4 95.4 95.4 96.2 95.7 98.0
1,2,3,4 0 71.4 92.3 93.4 96.7 96.7 95.7 93.6 94.1 94.4 96.3
Ave. 69.4 84.6 87.0 90.1 91.9 91.5 90.2 90.2 90.6 93.0
Table 1: Cross-view action recognition results (accuracy. %) on IXMAX dataset.
Source Target DICA [27] LRE-SVM [40] D-MTAE [12] CCSA [26] MMD-AAE[20] DANN [11] MLDG [18] Reptile [28] AGG Epi-FCR
L,C,S V 63.7 60.6 63.9 67.1 67.7 66.4 67.7 66.5 65.4 67.1
V,C,S L 58.2 59.7 60.1 62.1 62.6 64.0 61.3 61.0 60.6 64.3
V,L,S C 79.7 88.1 89.1 92.3 94.4 92.6 94.4 92.9 93.1 94.1
V,L,C S 61.0 54.9 61.3 59.1 64.4 63.6 65.9 64.9 65.8 65.9
Ave. 65.7 65.8 68.6 70.2 72.3 71.7 72.3 71.3 71.2 72.9
Table 2: Cross-dataset object recognition results (accuracy. %) on VLCS benchmark.

4.2 Evaluation on Ixmax dataset

Setting  The dataset contains 11 different human actions. All actions were video recorded by 5 cameras with different views (referred as 0,…,4). The goal is to train an action recognition model on a set of source views (domains), and recognise the action from a novel target view (domain). We follow [20] to keep the first 5 actions and use the same Dense trajectory features as input. For our method, we follow [20]

to use a one-hidden layer network with 2000 hidden neurons as our backbone and report the average result of 20 runs. The optimizer is M-SGD with learning rate 1e-4, momentum 0.9, weight decay 5e-5.

Results  From the results in Table 1, we can see that: (i) The vanilla aggregation method, AGG is a strong competitor compared to several prior published methods, as are DANN and Reptile, which are newly identified by us as effective DG algorithms. (ii) Overall our Epi-FCR models performs best, improving 2.4% on AGG, and 1.1% on prior state of the art MMD-AAE. (iii) Particularly in view 1&2 our method achieves new state-of-the art performance.

4.3 Evaluation on Vlcs dataset

Setting  VLCS domains share 5 categories: bird, car, chair, dog and person. We use pre-extracted DeCAF6 features and follow [26]

to randomly split each domain into train (70%) and test (30%) and do leave-one-out evaluation. We use a 2 fully connected layer architecture with output size of 1024 and 128 with ReLU activation, as per

[26] and report the average performance of 20 trials. The optimizer is M-SGD with learning rate 1e-3, momentum 0.9 and weight decay 5e-5.

Results  From the results in Table 2, we can see that: (i) The simple AGG baseline is again competitive with many published alternatives, as are DANN and Reptile. (ii) Our Epi-FCR method achieves the best performance, improving on AGG by 1.7% and performing comparably to prior state of the art MMD-AAE and MLDG with 0.6% improvement.

4.4 Evaluation on Pacs dataset

Setting  PACS is a recently released dataset with different object style depictions, and a more challenging domain shift than VLCS, as shown in [17]. This dataset shares 7 object categories across domains, including ‘dog’, ‘elephant’, ‘giraffe’, ‘guitar’, ‘house’, ‘horse’ and ‘person’. We follow the protocol in [17] including the recommended train and validation split for the fair comparison. We first follow [17] in using the ImageNet pretrained AlexNet (in Table 3) and subsequently also use a modern ImageNet pre-trained ResNet-18 (in Table 4) as a base CNN architecture. We train our network using the M-SGD optimizer (batch size/per domain=32, lr=1e-3, momentum=0.9, weight decay=5e-5) for 45k iterations when using AlexNet and train our network using same optimizer (weight decay=1e-4) for ResNet-18.

Source Target DICA [27] LRE-SVM [40] D-MTAE [12] DSN [3] TF-CNN [17] MLDG [18] Fusion [24] DANN [11] AGG Epi-FCR
C,P,S A 64.6 59.7 60.3 61.1 62.9 66.2 64.1 63.2 63.4 64.7
A,P,S C 64.5 52.8 58.7 66.5 67.0 66.9 66.8 67.5 66.1 72.3
A,C,S P 91.8 85.5 91.1 83.3 89.5 88.0 90.2 88.1 88.5 86.1
A,C,P S 51.1 37.9 47.9 58.6 57.5 59.0 60.1 57.0 56.6 65.0
Ave. 68.0 59.0 64.5 67.4 69.2 70.0 70.3 69.0 68.7 72.0
Table 3: Cross-domain object recognition results (accuracy. %) of different methods on PACS using pretrained AlexNet.

Results  From the AlexNet results in Table 3, we can see that: (i) Our episodic method obtained the best performance on two held out domains C and S and comparable performance on A, P domains. (ii) It also achieves the best performance overall, with 3.3% improvement on vanilla AGG, and 1.7% improvement on prior state of the art Fusion [24].

Meanwhile in Table 4, we see that with a modern ResNet-18 architecture, the basic results are improved across the board as expected. However: (i) While our newly identified DANN and Reptile manage to improve on the vanilla AGG, further demonstrating their effectiveness on DG. (ii) Our full episodic method maintains the best performance overall, with a 2.4% improvement on AGG.

We note here that when using modern architectures like [36, 13]

for DG tasks we need to be careful with batch normalization

[14]. Batchnorm accumulates statistics of the training data during training, for use at testing. In DG, the source and target domains have domain-shift between them, so different ways of employing batch norm produce different results. We tried two ways of coping with batch norm, one is directly using frozen pre-trained ImageNet statistics. Another is to unfreeze and accumulate statistics from the source domains. We observed that when training ResNet-18 on PACS with accumulating the statistics from source domains it produced a worse accuracy than freezing ImageNet statistics ( vs ).

Source Target AGG DANN [11] MLDG [18] Reptile [28] Epi-FCR
C,P,S A 77.6 81.3 79.5 80.7 82.1
A,P,S C 73.9 73.8 77.3 75.5 77.0
A,C,S P 94.4 94.0 94.3 94.9 93.9
A,C,P S 70.3 74.3 71.5 69.0 73.0
Ave. 79.1 80.8 80.7 80.0 81.5
Table 4: Cross-domain object recognition results (accuracy. %) of different methods on PACS using pretrained ResNet-18.
Figure 5: Cross-domain test accuracy with shared feature extractor or classifier. AC means, feed A data through C-specific module. Eg, left: , right: . Higher is better.

4.5 Further Analysis and Insights

Ablation Study  To understand the contribution of each component of our model, we perform an ablation study using PACS-AlexNet. From the results in Table 5, we can see that episodic training for the feature extractor, gives a 1.6% boost over the vanilla AGG. Including episodic training of the classifier, further improves 0.5%. Finally, combine all the episodic training components, provides 3.3% improvement over vanilla AGG. This confirms that each component of our model contributes to final performance.

Cross-Domain Testing Analysis  To understand how our Epi-FCR method obtains its improved robustness to domain shift, we study its impact on cross-domain testing. Recall that when we activate the episodic training of the agnostic feature extractor and classifier, we benefit from the domain specific branches by routing domain data across domain branches. E.g., we feed: to train Eq. 3, and to train Eq. 4.

Therefore it is natural to evaluate cross-domain testing after training the models. As illustrated in Fig. 5333To save space we only display the leave-photo-out split. The others are consistent with these observations., we can see that the episodic training strategy indeed improves cross-domain testing performance. For example, when we feed domain data to domain classifier , the Episodic-trained agnostic extractor improves the performance of the domain-C classifier who has never experienced domain A data (Fig. 5, left); and similarly for the Episodic-trained agnostic classifier.

Analysis of Solution Robustness  In the above experiments we confirmed that our episodic model outperforms the strong AGG baseline in a variety of benchmarks, and that each component of our framework contributes. In terms of analysing the mechanism by which episodic training improves robustness to domain shift, one possible route is through leading the model to find a higher quality minima. Several studies recently have analysed learning algorithm variants in terms of the quality of the minima that they leads a model to [15, 4].

One intuition is that converging to a ‘wide’ rather than ‘sharp’ minima provides a more robust solution, because perturbations (such as domain shift, in our case) are less likely to cause a big hit to accuracy if the model’s performance is not dependent on a very precisely calibrated solution. Following [15, 42], we therefore compare the solutions found by AGG and our Epi-FCR by adding noise to the weights of the converged model, and observing how quickly the testing accuracy decreases with the magnitude of the noise. From Fig. 6 we can see that both models performance drops as weights are perturbed, but our Epi-FCR model is more robust to weight perturbations. This suggests that the minima found by Epi-FCR is a more robust one than that found by AGG, which may explain the improved cross domain robustness of Epi-FCR compared to AGG.

Source Target AGG Epi-F Epi-FC Epi-FCR
C,P,S A 63.4 63.6 63.7 64.7
A,P,S C 66.1 69.9 69.2 72.3
A,C,S P 88.5 87.7 88.3 86.1
A,C,P S 56.6 60.0 62.1 65.0
Ave. 68.7 70.3 70.8 72.0
Table 5: Ablation study (accuracy. %) on PACS using pretrained AlexNet.
Figure 6: Minima quality analysis: Episodic learning (Epi-FCR) vs baseline (Agg).
Benchmark # of data # of Domains # of tasks task space
VLCS 10,729 4 5 homo.
PACS 9,991 4 7 homo.
VD-DG 238,215 9 2128 hetero.
Table 6: VD-DG vs previous DG benchmarks. For VD-DG we exclude the ImageNet due to regarding it as the initialization.
Target ImageNet PT AGG DANN [11] MLDG [18] Reptile [28] Epi-R
Concate Mean Concate Mean Concate Mean Concate Mean Concate Mean
Aircraft 12.7 17.4 14.6 17.4 15.0 17.4 14.2 17.9 16.0 17.7 13.9
D. Textures 35.2 37.7 35.1 37.9 36.6 38.3 34.6 37.9 35.9 40.2 37.8
VGG-Flowers 48.1 56.3 52.0 55.5 52.2 54.0 53.2 53.7 50.5 55.4 53.0
UCF101 35.0 43.3 35.0 44.5 36.1 44.4 36.7 45.3 36.7 45.7 37.1
Ave. 32.8 38.7 34.2 38.8 35.0 38.5 34.7 38.7 34.8 39.7 35.5
VD-Score 185 265 185 277 202 279 194 284 202 304 217
Table 7: Results of top-1 accuracy (%) and visual decathlon overall scores of different methods on VD-DG. Train on CIFAR-100, Daimler Ped, GTSRB, Omniglot, SVHN and test on Aircraft, D. Textures, VGG-Flowers, UCF101.

4.6 Evaluation on Vd-Dg dataset

Visual Decathlon contains ten datasets and was initially proposed as a multi-domain learning benchmark [30]. We re-purpose Decathlon for a more ambitious challenge of domain generalisation. As explained earlier, our motivation is find out if DG learning can improve the defacto standard ‘ImageNet trained CNN feature extractor’ for use as an off-the shelf representation for new target problems. Given the widespread usage of this workflow by vision practitioners, an improvement on a vanilla ImageNet/CNN could be of major practical value.

We compare this heterogeneous DG benchmark to the largest existing DG benchmarks by their image, domain, and task numbers in Table 6. VD-DG has twice the domains of VLCS and PACS. It is also an order of magnitude larger in terms of data and task numbers. This demonstrates the challenge and greater practical signifance of VD-DG.

Setting  For the experiments, we consider a setting where the five largest scale datasets (CIFAR-100, Daimler Ped, GTSRB, Omniglot and SVHN, excluding ImageNet444We always exploit ImageNet as an initial condition, but do not include it in DG training for computational feasibility) as our source domains, and the four smallest datasets (Aircraft, D. Textures, VGG-Flowers and UCF101) as our target domains. The goal is to use DG training among the five datasets to learn a feature which outperforms the off-the-shelf ImageNet-trained CNN that we use as an initial condition. We use ResNet-18 [13] as the backbone model, and resize all the images to 64x64 for computational efficiency. To support the VD heterogeneous label space, we assume a shared feature extractor, and a source domain-specific classifier. We perform episodic DG training among the source domains, using our (R)andom classifier model variant, which supports heterogeneous label-spaces. After DG training, the model will then be used as a fixed feature extractor for the held out target domain. These are combined by combination (concatenation and mean-pooling) with the original ImageNet pre-trained features555Since ImageNet is excluded from source domains for computational feasibility, there is loss of performance for all models compared to the original feature due to the forgetting effect.. This final feature is used to train a linear SVM for the corresponding task, as per common practice. We train the network using the M-SGD optimizer (batch size/per domain=32, lr=1e-3, momentum=0.9, weight decay=1e-4) for 100k iterations where the lr is decayed in 40k, 80k iterations by a factor 10.

Results  From the results in Table 7, we observed that: (i) We do learn a feature that is more robust to novel domains compared to the standard ImageNet pre-trained features (Epi-R vs ImageNetPT improves 7.1% or 2.7% on held-out datasets). (ii) Moreover, while the vanilla AGG baseline also improves the ImageNet features, our Epi-R provides a clear improvement on AGG. (iii) In terms of other DG competitors: we note that besides MLDG [18], the only other competitors that we were able to feasibly run on this large scale benchmark were DANN and Reptile – methods that we first identified as re-purposeable for DG in this paper. Other methods either do not support heterogeneous label-spaces or, do not scale to this many domains, or this many examples. (iv) Overall our Epi-R method outperforms all alternatives in both average accuracy, and also the VD score recommended in preference to accuracy in [30]. Overall this is the first demonstration that any DG method can improve robustness to domain shift in a larger-scale setting, across heterogeneous domains, and make a practical impact in surpassing ImageNet feature performance.

5 Conclusion

In this paper, we addressed the domain generalisation problem. We proposed a simple episodic training strategy that mimics the train-test domain-shift experienced in a DG scenario during training. We showed that our method achieves state of the art performance on all the main existing DG benchmarks. More importantly, we provided the first demonstration of DG’s potential value ‘in the wild’ – by demonstrating our model’s potential to improve the defacto standard ImageNet pre-trained CNN feature extractor by performing heterogeneous DG at the largest scale to date using the Visual Decathlon benchmark.


  • [1] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In NIPS, 2006.
  • [2] H. Bilen and A. Vedaldi. Universal representations: The missing link between faces, text, planktons, and cat breeds. Technical report, 2017.
  • [3] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In NIPS, 2016.
  • [4] P. Chaudhar, A. Choromansk, S. Soatt, Y. LeCun, C. Baldass, C. Borg, J. Chays, L. Sagun, and R. Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. In ICLR, 2017.
  • [5] M. J. Choi, J. Lim, and A. Torralba. Exploiting hierarchical context on a large database of object categories. In CVPR, 2010.
  • [6] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, 2014.
  • [7] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010.
  • [8] C. Fang, Y. Xu, and D. N. Rockmore. Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias. In ICCV, 2013.
  • [9] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
  • [10] Y. Ganin and V. Lempitsky.

    Unsupervised domain adaptation by backpropagation.

    In ICML, 2015.
  • [11] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial training of neural networks. JMLR, 2016.
  • [12] M. Ghifary, W. B. Kleijn, M. Zhang, and D. Balduzzi. Domain generalization for object recognition with multi-task autoencoders. In ICCV, 2015.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  • [15] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang.

    On large-batch training for deep learning: Generalization gap and sharp minima.

    In ICLR, 2017.
  • [16] A. Khosla, T. Zhou, T. Malisiewicz, A. Efros, and A. Torralba. Undoing the damage of dataset bias. In ECCV, 2012.
  • [17] D. Li, Y. Yang, Y.-Z. Song, and T. Hospedales. Deeper, broader and artier domain generalization. In ICCV, 2017.
  • [18] D. Li, Y. Yang, Y.-Z. Song, and T. Hospedales. Learning to generalize: Meta-learning for domain generalization. In AAAI, 2018.
  • [19] F.-F. Li, F. Rob, and P. Pietro. Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In CVPR Workshop on Generative-Model Based Vision, 2004.
  • [20] H. Li, S. Jialin Pan, S. Wang, and A. C. Kot. Domain generalization with adversarial feature learning. In CVPR, 2018.
  • [21] M. Long, Y. Cao, J. Wang, and M. Jordan. Learning transferable features with deep adaptation networks. In ICML, 2015.
  • [22] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Unsupervised domain adaptation with residual transfer networks. In NIPS, 2016.
  • [23] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.
  • [24] M. Mancini, S. R. Bulo‘, B. Caputo, and E. Ricci. Best sources forward: Domain generalization through source-specific nets. In ICIP, 2018.
  • [25] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. Meta-learning with temporal convolutions. In arXiv, 2017.
  • [26] S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto. Unified deep supervised domain adaptation and generalization. In ICCV, 2017.
  • [27] K. Muandet, D. Balduzzi, and B. Schölkopf. Domain generalization via invariant feature representation. In ICML, 2013.
  • [28] A. Nichol, J. Achiam, and J. Schulman. On first-order meta-learning algorithms. CoRR, abs/1803.02999, 2018.
  • [29] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In ICLR, 2017.
  • [30] S.-A. Rebuffi, H. Bilen, and A. Vedaldi. Learning multiple visual domains with residual adapters. In NIPS, 2017.
  • [31] S.-A. Rebuffi, H. Bilen, and A. Vedaldi. Efficient parametrization of multi-domain deep neural networks. In CVPR, 2018.
  • [32] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, Dec. 2015.
  • [33] B. Russell, A. Torralba, K. Murphy, and W. Freeman. Labelme: A database and web-based tool for image annotation. IJCV, 77, 2008.
  • [34] S. Shankar, V. Piratla, S. Chakrabarti, S. Chaudhuri, P. Jyothi, and S. Sarawagi. Generalizing across domains via cross-gradient training. In ICLR, 2018.
  • [35] J. Snell, K. Swersky, and R. S. Zemel. Prototypical networks for few shot learning. In NIPS, 2017.
  • [36] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.

    Rethinking the inception architecture for computer vision.

    In CVPR, 2016.
  • [37] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv, 2014.
  • [38] R. Volpi, H. Namkoong, O. Sener, J. Duchi, V. Murino, and S. Savarese. Generalizing to unseen domains via adversarial data augmentation. In arxiv, 2018.
  • [39] D. Weinland, R. Ronfard, and E. Boyer. Free viewpoint action recognition using motion history volumes. CVIU, 2006.
  • [40] Z. Xu, W. Li, L. Niu, and D. Xu. Exploiting low-rank structure from latent domains for domain generalization. In ECCV, 2014.
  • [41] Y. Yang and T. M. Hospedales. A unified perspective on multi-domain and multi-task learning. In ICLR, 2015.
  • [42] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu. Deep mutual learning. In CVPR, 2018.

Appendix A Reptile for DG

In this appendix we explain how the Reptile algorithm [28], originally designed for few-shot learning, can be adapted to the DG problem setting.

1:Input: n souce domains.
2:Initialize the initial parameters , inner step size and outer step size
3:for iteration = 1, 2, … do
6:     for  in [1, n] do
8:     end for
9:     Update
10:end for
Algorithm 2 Homogeneous Reptile-DG

a.1 Homogeneous Reptile-DG

We first consider homogeneous DG. The re-purposed Reptile-DG is shown in Algorithm 2. Given the source domains , in each iteration we copy the training parameters as . Then we randomly sample a mini-batch from one domain and do one-step inner SGD update on . Once we have sampled the mini-batches from all source domains, we do one-step outer update on to update it towards .

How does Reptile-DG work?  If we refer the loss and parameters of inner step as and , the inner gradient of that step is . And we can get . Then we do the Taylor series of at initial point , we can get equation


the items are omitted due to their small values. If we treat the gradient and hessian of w.r.t as and , we would have


Equivalently, we get . Then together with , the Eq.7 becomes


And if we consider an example with two source domains. The gradient update of Reptile-DG is,


And when we bring Eq.9 in, we get


As mentioned in Reptile [28], if we take the expectation over two inner-step losses , we get the

AveGradInner (12)

Thus, is the direction that increases the inner product between gradients of mini-batches from different source domains, which is similar to what MLDG [18] does.

1:Input: n souce domains.
2:Initialize the feature module and classifier modules , inner step size and outer step size
3:for iteration = 1, 2, … do
6:     for  in [1, n] do
9:     end for
10:     Update
11:end for
Algorithm 3 Heterogeneous Reptile-DG

Benefit of Reptile-DG  Unlike MLDG [18], in each iteration Reptile-DG just samples the mini-batches from all source domains in a random order without explicitly splitting the source domains into meta-train and meta-test sets. In addition, optimization by shortest path descent does not require second-order gradients like the meta-optimization in [18].

a.2 Heterogeneous Reptile-DG

As analyzed in previous section, Reptile-DG learns to maximize the inner product of gradients of different mini-batches. It assumes that all source domains share the entire model. In heterogeneous DG, the source domains have different task spaces and only the feature extractor module is shared. Thus, to apply Reptile-DG to the heterogeneous setting, we apply the optimization to the feature extractor alone, as shown in Algorithm 3. By maximizing the inner product of gradients w.r.t the feature extractor between different mini-batches, it improves the generalization of the feature representation. If the feature extractor and the classifiers are not flexibly decomposable, a practical way of implementing Algorithm 3 is we directly do the shortest descent optimization on the entire model similar to Algorithm 2, i.e. includes the feature extractor and all the classifiers . But, because the source domains only share the feature extractor, the GradInner item still only applies to feature extractor parameters .