DeepAI
Log In Sign Up

Learning to Generalize One Sample at a Time with Self-Supervision

Although deep networks have significantly increased the performance of visual recognition methods, it is still challenging to achieve the robustness across visual domains that is necessary for real-world applications. To tackle this issue, research on domain adaptation and generalization has flourished over the last decade. An important aspect to consider when assessing the work done in the literature so far is the amount of data annotation necessary for training each approach, both at the source and target level. In this paper we argue that the data annotation overload should be minimal, as it is costly. Hence, we propose to use self-supervised learning to achieve domain generalization and adaptation. We consider learning regularities from non annotated data as an auxiliary task, and cast the problem within an Auxiliary Learning principled framework. Moreover, we suggest to further exploit the ability to learn about visual domains from non annotated images by learning from target data while testing, as data are presented to the algorithm one sample at a time. Results on three different scenarios confirm the value of our approach.

READ FULL TEXT VIEW PDF
03/17/2019

AdaGraph: Unifying Predictive and ContinuousDomain Adaptation through Graphs

The ability to categorize is a cornerstone of visual intelligence, and a...
03/17/2019

AdaGraph: Unifying Predictive and Continuous Domain Adaptation through Graphs

The ability to categorize is a cornerstone of visual intelligence, and a...
06/18/2021

Gradual Domain Adaptation via Self-Training of Auxiliary Models

Domain adaptation becomes more challenging with increasing gaps between ...
12/04/2021

Unsupervised Domain Generalization by Learning a Bridge Across Domains

The ability to generalize learned representations across significantly d...
07/18/2020

Learning from Extrinsic and Intrinsic Supervisions for Domain Generalization

The generalization capability of neural networks across domains is cruci...
03/16/2019

Domain Generalization by Solving Jigsaw Puzzles

Human adaptability relies crucially on the ability to learn and merge kn...
11/27/2020

Task Programming: Learning Data Efficient Behavior Representations

Specialized domain knowledge is often necessary to accurately annotate t...

1 Introduction

As visual recognition algorithms get ready to be deployed in several markets, the need for tools to ensure robustness across various visual domains becomes more pressing. Even when massive amounts of data are available, the underlying distributions of training (source) and test (target) data are inevitably going to be different. Research in the area of adaptive learning has addressed this general issue in various sub-cases, from early works on semi-supervised Domain Adaptation (DA) [Saenko:2010, KulisSD11] up to very recent attempts to deal with Domain Generalization (DG) [Li_2018_CVPR, hospedales19] (for a comprehensive review see section 2).

Figure 1: In our approach we propose to exploit self-supervision as an auxiliary task together with the primary supervised task. Both are learned over multiple source data regardless of their exact domain label (no need to separate S1 from S2 and S3). Incoming unlabeled target samples actively adapt a self-supervised feature extractor module with finetuning. The refined representation is then aggregated with source-based knowledge for the final label prediction.

An important aspect that remains to be evaluated is the real data annotation effort that is still needed by the existing DA and DG methods. Given that the minimum amount of labeling for a multi-class categorization approach corresponds to the class identity of the training images (Source Only in Figure 2, left), we see that most of the DG algorithms require training data to be annotated also with respect to their source domain labels [Li_2018_CVPR, hospedales19]. Approaches proposing to leverage over unlabeled auxiliary domains require metadata describing their structure and their relation to the labeled source [adagraph]. DA algorithms need advanced access to large quantity of images from the target domain, all depicting the very same classes imaged in the source data, with few notable exceptions [PADA_eccv18, Saito_2018_ECCV, cocktail_CVPR18]

; and so forth, up to transfer learning techniques where the source model is useful only in relation to a good amount of labeled target data, big enough to allow CNN training convergence (Figure

2, right). All these types of data knowledge reveal that the entry point of the current state of the art algorithms asks for an annotation effort that might still be too costly from the point of view of an end user. Starting from this scenario, the goal of our work is to push visual recognition yet one step closer to deployment in the wild. We aim for a principled method able to generalize to new domains by using only the source class annotation (no source domain labels) and that, given a single unlabeled target sample at test time, can leverage over its inherent knowledge for a further training before the final prediction (see Figure 1). To do that, we exploit self-supervised data knowledge to regularize the learning process of a source classification model. Similarly to [jigen], we take into consideration the spatial co-location of patches for an image decomposed and reorganized as in a jigsaw puzzle. However, rather than using a flat multi-task architecture, we design a residual block that focuses on self-supervised information and provides the main fully-supervised learning flow with useful complementary knowledge (see Figure 3). This strategy has two main advantages. On one side it improves the stability of the results, removing the need for further control conditions on the classification model such as the introduction of an entropy loss in the DA setting as in [jigen], which requires a dedicated tuning process for its relative weight. On the other side, by concentrating the use of self-supervision into a specific part of the network rather than having it distributed, we can easily fine-tune only the auxiliary model at test time on each single test sample.

Summarizing, the contributions of this paper are:

  • [leftmargin=*]

  • we introduce a new end-to-end deep learning algorithm able to Generalize One Sample at a time (GeOS) using the same amount of data annotation as the naïve

    Source Only baseline.

  • we show how self-supervised knowledge can be used in a principled auxiliary learning framework for DG and DA, improving in robustness and performance over the existing flat multi-task approach [jigen].

  • we present a new generalization setting where it is possible to further train the learning model over every single test sample by exploiting self-supervision.

  • we show how to get top results in the predictive DA setting [adagraph] without the need of human annotated auxiliary knowledge.

    2 Related Work

    Figure 2: Overview of DA and DG most recent literature sorted on the basis of the amount of data annotation needed.

    Self-Supervised Learning

    SSL is a framework developed to learn visual features from large-scale unlabeled data [SSLsurvey]. Its first step is the choice of a pretext task that exploits inherent data attributes to automatically generate data labels. It has been shown that the semantic knowledge captured by the first layers of a network solving those tasks defines a useful initialization for new learning problems. Indeed the second SSL step consists in transferring the self-supervised learned model of those initial layers to a real downstream task (classification, detection), while the ending part of the network is newly trained. The advantage provided by the transferred model generally gets more evident, as the number of annotated samples of the downstream task is low.

    The pretext tasks can be organized in three main groups. One group rely only on original visual cues and involves either the whole image with geometric transformations (translation, scaling, rotation [gidaris2018unsupervised, NIPS2014_geometric]), clustering [caron2018deep], inpainting [pathakCVPR16context]

    and colorization

    [zhang2016colorful], or considers image patches focusing on their equivariance (learning to count [learningtocount]) and relative position (solving jigsaw puzzles [NorooziF16, Noroozi_2018_CVPR]). A second group uses external sensory information either real or synthetic: this solution is often applied for multi-cue (visual-to-audio [audiovisual], RGB-to-depth [ren-cvpr2018]) and robotic data [grasp2vec, visiontouch]. Finally, the third group relies on video and on the regularities introduced by the temporal dimension [Wang_UnsupICCV2015, SSLvideo]. The most recent SSL research trends are mainly two. On one side there is the proposal of novel pretext tasks, compared on the basis of their ability to initialize a downstream task with respect to using supervised models as in standard transfer learning [OquabTL, YosinskiNIPS2014, FTmedical, SpotTune, Long:2015, liICML18]. On the other side there are new approaches to combine multiple pretext tasks together in multi-task settings [multitaskSSL, ren-cvpr2018].

    Domain Adaptation and Generalization

    To cope with domain shift, several algorithms have been developed mainly in two different settings. In DA the learning process has access to the labeled source data and to the unlabeled target data, thus the aim is to generalize to that specific target set [csurka_book]. The semi-supervised DA case considers also the availability of a limited number of annotated target samples [Saenko:2010, KulisSD11, doretto2017, semisup1, semisup2, ijcai2018, museumECCV18]. In DG the target is unknown at training time: the learning process can usually leverage on multiple sources to define a model robust to any new, previously unseen domain [shallowDG]. In both DA and DG, the main assumption is that source and target share the same label set, with very few works studying exceptions to this basic condition [PADA_eccv18, cocktail_CVPR18, Saito_2018_ECCV].

    Feature-level strategies focus on learning domain invariant data representations mainly by minimizing different domain shift measures [Long:2015, LongZ0J17, dcoral, hdivergence]

    . The domain shift can also be reduced by training a domain classifier and inverting the optimization to guide the features towards maximal domain confusion

    [Ganin:DANN:JMLR16, Hoffman:Adda:CVPR17]. This adversarial approach has several variants, some of which also exploit class-specific domain recognition modules [saito2017maximum, Li_2018_ECCV]. Metric learning [doretto2017]

    and deep autoencoders

    [DGautoencoders, Li_2018_CVPR, Bousmalis:DSN:NIPS16] have also been used to search for domain-shared embedding spaces. In DG, these approaches leverage on the availability of multiple sources and on the access to the domain label for each sample, meaning that the identity of the source distribution from which every sample is drawn is strictly needed.

    Model-level strategies either change how the data are loaded with ad-hoc episodes [hospedales19], or modify conventional learning algorithms to search for more robust minima of the objective function [MLDG_AAA18], or introduce domain alignment layers in standard learning networks [carlucci2017auto]. Those layers can also be used in multi-source DA to evaluate the relation between the sources and target and then perform source model weighting [MassiRAL, cocktail_CVPR18]. Several DG approaches aim at identifying and neglecting domain-specific signatures from multiple sources both through shallow and deep methods that exploit multi-task learning [ECCV12_Khosla], low-rank network parameter decomposition [hospedalesPACS, Ding2017DeepDG] or aggregation layers [Antonio_GCPR18, hospedales19]. In multi-source DA the domain label of the sources may be unknown [mancini2018boosting, hoffman_eccv12, carlucci2017auto], while for the DG it remains a crucial information that has to be provided since the beginning.

    Finally, many recent methods adopt data-level solutions based on variants of the Generative Adversarial Networks (GANs, [Goodfellow:GAN:NIPS2014]) to synthesize new images. Indeed, it is possible to reduce the domain gap by producing source-like target images or/and target-like source images [russo17sbadagan, cycada], as well as a sequence of intermediate samples shifting from the source to the target [DLOW]. The data augmentation strategies in [DG_ICLR18, Volpi_2018_NIPS] learn how to properly perturb the source samples, even in the challenging case of DG from a single source. The combination of data- and feature-level strategies has also shown further improvements in results [ADAGE, sankaranarayanan2017generate].

    Some recent works have started investigating intermediate settings between DA and DG. In Predictive DA a labeled source and several auxiliary unlabeled domains are available at training time together with metadata that describe their relation [adagraph, multivariatereg]. Other works propose approaches to push model-based DA solutions towards the DG setting adding a memory able to accumulate over multiple target samples at test time [adagraph, MassiRAL]. Although it is an interesting direction for online and continuous learning, it might only be seen as an upper limit condition to real DG in the wild, where we need a separate prediction for every sample. Moreover, [jigen] has recently started a new research direction moving SSL from the transfer learning to the domain generalization setting, showing that self-supervision provides useful auxiliary information to close the domain gap. In particular it showed that solving jigsaw puzzles improves the generalization properties of a supervised classification when both the models are jointly learned with a flat multi-task approach.

    Multi-Task and Auxiliary Learning

    MTL aims at simultaneously training over several tasks that mutually help each other [CaruanaMTL]. In deep learning this means searching for a single feature representation that works well for multiple tasks. This framework is at the basis of many CNN segmentation and detection algorithms [SimultaneousECCV14, FRCNN]. Several architectures have been investigated to better exploit inter-task connections and task-knowledge complementarity, while growing the number of combined tasks [Kokkinos_2017_CVPR, stitch, AZ_SS]

    . Although powerful, MTL has one main drawback: it is sensitive to the weight assigned to each task, the choice of the scaling coefficient used to combine multiple loss weights. This causes the need for an extensive hyperparameter tuning

    [Kokkinos_2017_CVPR] or for principled loss weighting strategies. Some recent approaches leverages on the evaluation of task uncertainty [fullyadaptive, kendall2017multi] and dynamically adjust the weights [gradnorm, Guo_2018_ECCV].

    In many real applications the tasks are not all equally important and some prior knowledge on their ranking is available. In particular, the case with one main primary and several other auxiliary tasks is known as Auxiliary Learning (AL) and is related to the literature on learning with priviledged information [LUPI]. Very recently [NIPS2018ROCK] presented a residual strategy to integrate multi-modal auxiliary tasks and improve the performance of the primary object detection task. In [auxtasks] the main focus is in the choice of auxiliary tasks which should be as cheap as possible in terms of annotation and learning effort. This research direction is currently attracting more and more attention with also the introduction of unsupervised [aux2] and self-supervised [aux3] auxiliary tasks.

    3 Generalize from One Sample

    Figure 3: Schematic illustration of GeS architecture for learning with self-supervised auxiliary information. The primary network is trained on the supervised task. The auxiliary component refines features for the main classifier, while being trained to solve the Jigsaw Puzzle problem. Lines indicate feature paths in the network. A dotted line means gradients won’t be computed for the underlying layers.

    The standard DG problem setting considers source domains, each with image-label pairs . The goal is learning a model that generalizes to any test sample drawn from a new target. The source domain index is needed by most of the existing DG algorithms, which use it to separate source-specific from source-generic knowledge. We choose instead to ignore it and deal directly with samples with where , focusing only on the class annotation . Moreover, by operating simple geometric transformations on , we can get a variety of new versions , with . Examples of transformations may be °rotations that lead to possible versions of each sample [gidaris2018unsupervised], or -patch based decomposition and shuffling as in a jigsaw puzzle, that leads to variants for each sample [jigen]. The obtained self-supervised data-label pairs where with , allow to define an auxiliary classification task that can be trained jointly with the primary one , improving its generalization effect across multiple sources.

    Training Process

    The general architecture of our model is shown in Figure 3. It is composed by a main convolutional backbone that extracts the features from the original images . It then provides them as input to the fully connected module of the primary task, that is in charge of computing the classification prediction. To this fairly general network, we add a new residual auxiliary block that deals with self-supervised data-label pairs. We focus on the jigsaw puzzle task, following the same approach used in [jigen]. In particular, the original images are decomposed through a regular grid in tiles which are then randomly re-assigned to one of the grid positions (Figure 3, bottom left). Out of all the possible permutations, we considered a set of cases, using the Hamming distance based algorithm in [NorooziF16]

    . Thus, the auxiliary block takes as input the features extracted by the fully connected part of the main network from the scrambled images

    . It then further process them through few extra convolutional layers, before entering the final fully connected auxiliary classification module that recognizes the puzzle permutation. We indicate with the features encoded by the auxiliary block (from the original images) that contribute back to the primary task representation. Indeed, the input to the fully connected module of the primary network is the element-wise sum .

    We underline that, although the primary and the auxiliary tasks share the initial feature extraction process and present the described final feature recombination point, they are actually optimized independently. By indicating with the cross-entropy loss of the primary task and with the cross-entropy loss of the auxiliary jigsaw task, we overall train the network by optimizing the two following objectives:

    (1)
    (2)

    To summarize it in words, the gradients of the auxiliary loss do not backpropagate into the primary network, and the gradients of the primary loss affect the auxiliary module only indirectly through the update of the initial convolutional part of the main network.

    Testing Process and One Sample Learning

    Given a test sample from an unknown target domain we extract both the primary and the auxiliary features from it to feed the classification model, get the prediction and check whether the assigned class is correct or not. With respect to this naïve testing process, the self-supervised nature of the auxiliary task gives us the possibility to further learn from the the single available test sample. Indeed we can always decompose the sample in patches to create its shuffled variants and further minimize the auxiliary puzzle classification loss. In this way the auxiliary block is fine-tuned on the single observed example and we can expect a benefit from recombining the auxiliary features with those of the primary model. The exact procedure of auxiliary learning from one sample at test time is described in Algorithm 1.

    Data: source trained model, test sample
    , source trained model
    while still iterations do
           (, ) generate random self-supervised mini-batch from test sample variants
           minimize the loss
           update
          
    end while
    predict label of test sample using and
    Algorithm 1 One Test Sample Learning

    Implementation Details

    We instantiate the main network backbone as a ResNet18 architecture and use a standard residual block as our auxiliary self-supervised module. Specifically, the auxiliary block implements a fully connected layer after the last convolution for self-supervised predictions. The main network, parametrized by

    , is initialized with a pre-trained ImageNet model, while for the auxiliary block parametrized by

    , we use random uniform weights. The output of the main network and that of the auxiliary block are aggregated with a plain element-wise sum. For each training iteration we feed the network with mini-batches of original and transformed images using batch accumulation to synchronously update and . Our architecture has a similar structure to the one recently presented in [NIPS2018ROCK], but we implemented a tailored backpropagation policy to keep separated the primary and the auxiliary learning process by zeroing the gradients at both the input and output ends of the auxiliary block.

    4 Experiments

    Datasets

    The proposed GeOS algorithm is mainly designed to work in the DG setting with data from multiple sources, using only the sample category labels and ignoring the domain annotation. In other words, GeOS works with the same amount of data knowledge of the naïve Source Only reference, also known as Deep All since a basic CNN network can be trained on the overall aggregated source samples.

    To test GeOS in the DG scenario we focused on the PACS dataset [hospedalesPACS] that contains approximately 10.000 images of 7 common categories across 4 different domains (Photo, Art painting, Cartoon, Sketch) characterized by large visual shifts. We further investigate the behaviour of GeOS in the multi-source DA setting with the same dataset, always considering one domain as target and the other three as sources.

    Finally we evaluate GeOS in Predictive Domain Adaptation (PDA), a particular DG setting that has been recently put under the spotlight by [adagraph]. Here a single labeled data is available at training time together with a set of unlabeled auxiliary domains which are provided together with extra metadata (image timestamp, camera pose, ) useful to derive the reciprocal relation among the auxilary sets and the labeled source. For PDA we follow [adagraph], testing on CompCars [Yang_2015_CVPR] and Portraits [Ginosar_2015_ICCV_Workshops]. The first one is a large-scale dataset composed of 136,726 vehicle photos taken in the space of 11 years (from 2004 to 2015). As in [adagraph], we selected a subset of 24,151 images organized in 4 classes (type of vehicle: MPV, SUV, sedan and hatchback) and 30 domains obtained from the combination of the year of production (range between 2009 and 2014) and the perspective of the vehicle (5 different view points). The second dataset is a large collection of pictures taken from American high school year-books. The photos cover a time range between 1905 and 2013 over 26 American states. Also in this case we follow [adagraph] for the experimental protocol: we define a gender classification task performed on 40 domains obtained choosing 8 decades (from 1934) and 5 regions (New England, Mid Atlantic, Mid West, Pacific and Southern).

    Domain Generalization

    To align our PACS experiments with the training procedure used in [jigen], we apply random cropping, random horizontal flipping, photometric distortions and resize crops to 222222 so that we get equally spaced square tiles on a 3

    3 grid for the jigsaw puzzle task. We train the network for 40 epochs using SGD with momentum set at 0.9, an initial learning rate of 0.001, a cumulative batch size of 128 original images and 128 shuffled images and a weight decay of

    . We divide train inputs in 90% train and 10% validation splits, and test on the target with the best performing model on the validation set. By indicating the auxiliary task loss weight with , we achieve the training convergence for the self-supervised task by assigning , and use that value for all our experiments, including DA and PDA settings, without further optimization. We also leave hyperparameters for the one-sample finetuting steps fixed to their initial training values.

    The obtained results are shown in Table 1, together with several useful baselines. In particular, JiGen [jigen] was the first method showing that self-supervision tasks can support domain generalization, while D-SAM [Antonio_GCPR18] and EPI-FCR [hospedales19] propose networks with domain specific aggregation layers and domain specific models respectively, with the second one introducing also a particular episodic training procedure and getting the current DG state of the art on PACS. DANN [ganin2014unsupervised] exploits a domain adversarial loss to obtain a source invariant feature representation. MLDG [MLDG_AAA18] is a meta-learning based optimization method. We underline that all these baseline, with the notable exception of JiGen, need source data provided with both class and domain label. On this basis, the advantage that GeOS shows with respect to EPI-FCR is even more significant. Since also JiGen leverages over self-supervised knowledge, it might benefit of the One Sample Learning procedure at test time as in GeOS. For a fair comparison we used the code provided by the authors, implementing and running on it our Algorithm 1. The row JiGen + OS reports the obtained results, showing a small advantage over the original JiGen, confirming the beneficial effect of the fine tuning procedure. However the gain is still limited with respect to the top result of GeOS: the flat multi-task architecture of JiGen implies a re-adaptation of the whole network which might be out of reach with a single target sample. This confirms the effectiveness of the chosen auxiliary learning structure chosen for GeOS.

    PACS-DG art_paint. cartoon sketches photo Avg.
    Resnet-18
    [Antonio_GCPR18] Deep All 77.87 75.89 69.27 95.19 79.55
    D-SAM 77.33 72.43 77.83 95.30 80.72
    [hospedales19] Deep All 77.60 73.90 70.30 94.40 79.10
    DANN 81.30 73.80 74.30 94.00 80.08
    MLDG 79.50 77.30 71.50 94.30 80.70
    EPI-FCR 82.10 77.00 73.00 93.90 81.50
    [jigen] Deep All 77.85 74.86 67.74 95.73 79.05
    JiGen 79.4 75.25 71.35 96.03 80.51
    JiGen + OS 79.40 75.24 72.26 96.27 80.79
    GeOS 79.79 75.06 76 96.65 81.88
    Table 1: Domain Generalization results on PACS. The results of GeOS are average over 3 repetitions of each run. Each column title indicates the name of the domain used as target.
    PACS-DG art_paint. cartoon sketches photo Avg.
    Resnet-18
    null hypothesis 79.26 74.09 70.13 96.23 79.93
    GeS 78.95 74.36 73.99 96.29 80.90
    79.74 74.84 75.35 96.53 81.62
    79.79 75.01 76 96.61 81.85
    79.79 75.06 76 96.65 81.88
    79.49 74.11 70.6 95.87 79.52
    78.19 74.81 71.63 95.79 80.11
    Table 2: Analysis of several variants of GeOS: not using the auxiliary knowledge, turning off the one sample finetuning at test time, increasing the number of self-supervised iterations on the target sample and also changing the self-supervised task from solving jigsaw puzzles to image rotation recognition.

    Analysis and Discussion

    We provide a further in-depth analysis of the proposed method, starting from the results in Table 2. First of all we trained the same network architecture of GeOS but without using the auxiliary self-superivised data: in this case we start from the same hyperparameter initialization setting used for GeOS but we turn on the gradient propagation over the auxiliary network block which now behaves as an extra residual layer for the main primary model. The row null hypothesis in the table indicates that the advantage of GeOS is not due to the increased depth and parameter count, but originates instead from the proper use of self-supervision and one sample fine tuning. To even decouple these last two components, we turn off the one sample learning procedure at test time: the obtained version GeS of our algorithm still outperform JiGen and many of the other competitive methods in Table 1, that yet use more data annotation.

    When the fine tuning procedure on the test sample is on, it is possible to optimize the auxiliary network block with a different number of SGD iterations. We show that the obtained results increase with the number of iterations, but are already remarkable with a single one. Finally we evaluate the effectiveness of GeOS and its simplified version GeS when changing the type of self-supervised knowledge used as auxiliary information. Precisely we follow [gidaris2018unsupervised] and rotate the images at steps of 90°, training the auxiliary block for recognition among the four orientations. In this case GeS does not provide any advantage with respect to the null hypothesis baseline. This reveals that the choice of the self-supervised task influences the generalization capabilities of our approach, but the possibility to still run fine tuning on every single test sample maintains a beneficial effect.

    PACS-DA art_paint. cartoon sketches photo Avg.
    Resnet-18
    [mancini2018boosting] Deep All 74.70 72.40 60.10 92.90 75.03
    Dial 87.30 85.50 66.80 97.00 84.15
    DDiscovery 87.70 86.90 69.60 97.00 85.30
    [jigen] Deep All 77.85 74.86 67.74 95.73 79.05
    JiGen 84.88 81.07 79.05 97.96 85.74
    GeS 80.96 77.56 78.78 97.39 83.67
    Table 3: Multi-source Domain Adaptation results on PACS obtained as average over 3 repetitions for each run.

    Unsupervised Domain Adaptation

    Although designed for DG, our learning approach can also be used in the DA setting. To test its performance, we run experiments on PACS as already done in [jigen]. We choose the same training hyperparameters used in DG experiments, with the difference that we train the self-supervised task using images from the target unlabeled domain only, and we validate the network on the self-supervised jigsaw puzzle task using an held-out split from the target. Since all the target data are now available at once, the one sample finetuning strategy is superfluous, thus we fall back to the simplified GeS version of our approach. Even just exploiting the self-supervised knowledge and not using any explicit domain adaptation strategy, results in Table 3 show that GeS reduces the domain gap with the target domain, yielding an accuracy increase of more than 4 percentage points over the Deep All baseline.

    Both DDiscovery [mancini2018boosting] and Dial [carlucci2017auto] are methods that can be applied on the whole set of source samples without their domain label, as well as JiGen [jigen], thus the comparison with GeS here is fair in terms of data annotation involved. However, it is useful to remark that all those methods minimize an extra entropy loss on the target data. Although it might be beneficial for adaptation, this further learning condition is not applicable in the DG setting and introduces a further computational burden due to the need of tuning the relative loss weight to adjust its relevance with respect to the other losses already included in the training model. For a better understanding, we focus on JiGen and analyze its behaviour when changing the entropy loss weight . The obtained performance is presented in Figure 4 and clearly indicate that JiGen is fairly sensitive to , besides having overall more ad-hoc hyperparameters that GeS and GeOS.

    Figure 4: Analysis of JiGen in the PACS DA setting. The parameter weights the entropy loss that involves the target data. Moreover the method exploits two different () auxiliary loss weights related to the self-supervised task, besides the parameter used to regulate the data loading procedure for original and shuffled images.
    Resnet-18
    Method CompCars Portraits-Dec. Portraits-Reg.
    Baseline 56.8 82.3 89.2
    AdaGraph 58.8 87.0 91.0
    GeS 60.2 87.1 91.6
    GeOS 60.0 87.1 91.5
    Table 4: Predictive DA results.

    Predictive DA without metadata

    The minimal need of supervision of GeOS puts it in a particularly profitable condition with respect to other existing DG methods in the challenging Predictive DA experimental setting. Indeed GeOS can ignore the availability of metadata and exploit directly the large scale unlabeled auxiliary sources. We compare the performance of our method against AdaGraph [adagraph]

    , a very recent approach that exploits domain-specific batch-normalization layers to learn models for each source domain in a graph, where the graph is provided on the basis of the source auxiliary metadata.

    We follow the experimental protocol described in [adagraph]. For CompCars, we select a pair of domains as source and target and use the remaining 28 as auxiliary unlabeled data. Considering all possible domain pairs, we get 870 experiments and observe the average accuracy results over all of them. A similar setting is applied for Portraits, for which we consider the across decades scenario (source and target domains selected from the same decade) and the across region scenario (source and target from the same region). In total we run 440 experiments for Portraits.

    More in details, for CompCars, we start from an ImageNet pretrained model and trained for 6 epochs on source domain using Adam as optimizer with weight decay of . The batch size used is 16 and the learning rate is for the classifier and for the rest of the network; the learning rate is decayed by a factor of 10 after 4 epochs. In the case of Portraits the main learning procedure remains the same used above, except for the number of epochs that in this case is 1 and for the jigsaw weight that in this case was set to for the experiments across decades and to for the experiments across regions.

    Table 4 show the obtained results, indicating that GeS outperforms AdaGraph in all settings, despite using much less annotated information. In this particular setting, turning on the fine tuning process on a each target sample is irrelevant: the amount of auxiliary source data is so abundant that the self-supervised auxiliary task is already providing its best generalization effect, thus GeOS does not show any further advantage with respect to GeS.

    5 Conclusions

    This paper presented the first algorithm for domain generalization able to learn from target data at test time, as images are presented for classification. We do so by learning regularities about target data as auxiliary task through self-supervision. The algorithm is very general and can be used with success in several settings, from classic domain adaptation to domain generalization, up to scenarios considering the possibility to access side domains [adagraph]. Moreover, the principled AL framework leads to a notable stability of the method with respect to the choice of its hyperparameters, a highly desirable feature from deployment in realistic settings. Future work will further investigate this new generalization scenario, studying the behaviour of the approach with respect to the amount and the quality of unsupervised data available at training time.

    References

    Supplementary Material

    We provide here an extended discussion and further evidences of the advantage introduced by learning to generalize one sample at a time through the proposed auxiliary self-supervised finetuning process. First of all we clarify the difference between our full method named GeOS and its simplified version GeS.

    GeS is the architecture we designed for deep learning Generalization by exploiting Self-Supervision. Its structure is depicted in Figure 3 of the paper. Besides the main network backbone that tackles the primary classification task, we introduce an auxiliary block that deals with the self-supervised objective. It provides useful complementary features that are finally recombined with those of the main network improving the robustness of the primary model.

    We mostly focused on the jigsaw puzzle self-supervised task, thus our auxiliary data are scrambled version of the original images, recomposed with their own patches in disordered positions. This specific formalization for jigsaw puzzle was recently introduced in [jigen], where the method JiGen learns jointly over the ordered and the shuffled images with a flat multi-task architecture. Although it showed to be effective, this approach substantially disregards the warnings highlighted in [NorooziF16] about the need of avoiding shortcuts that exploit low-level image statistics rather than high-level semantic knowledge while solving a self-supervised task. By fitting the self-supervised knowledge extraction in an auxiliary learning framework, GeS keeps the beneficial effect provided by self-supervision without the risk of confusing the learning process with low-level jigsaw puzzle specific information. Indeed the auxiliary knowledge is extracted by a dedicated residual block towards the end of the network together with a tailored backpropagation strategy that keeps the primary and the auxiliary tasks synchronized but separated in their specific objectives.

    In the DA setting, the auxiliary block of GeS is trained exclusively by the target images, that are available at training time but are unlabeled. In this case when the ordered source images enter the auxiliary block we obtain target-style-based features that allow to bridge the gap across domains. Indeed, these features are recombined with the ones from the main backbone and together guide the learning process of the primary classification model. In DG, the ordered source images are provided as input to the primary classification task, while their scrambled versions are fed to the auxiliary block. Thus the scrambled source images train the auxiliary block, that is finally used as a complementary feature extractor for the respective ordered images.

    At test time, for DA each of the ordered target images pass both through the main and through the auxiliary block for feature extraction using the network model obtained at the end of the training phase. In DG, for each target sample we may follow the same procedure used for DA testing. However, we can also do more by leveraging on self-supervision to distill further knowledge.

     target run GeS GeOS GeOS GeOS
     photo 1 96.41 +0.12 +0.18 +0.30
    2 96.41 +0.30 +0.48 +0.36
    3 96.05 +0.30 +0.30 +0.42
     art_paint. 1 79.00 +0.74 +1.13 +1.37
    2 78.71 +0.83 +0.33 +0.09
    3 79.15 +0.78 +0.64 +0.64
     cartoon 1 73.72 +0.85 +0.67 +0.79
    2 74.23 +0.17 +0.34 +0.39
    3 75.13 +0,42 +0.55 +0.51
     sketches 1 73.45 +1.53 +2.04 +2.04
    2 74.14 +1.35 +2.16 +1.81
    3 74.37 +1.20 +1.83 +2.19
    Table 5: PACS-DG accuracy gains of GeOS over GeS when finetuning separately over each target sample at test time with an increasing number of iterations.

    GeOS is our full method that exploits the architecture of GeS and runs a fine-tuning procedure at test time for each target sample in the DG setting. The target image is scrambled and provided as input to the auxiliary block which is initialised with the model obtained from the scrambled source images at training time. Although starting from a single instance, the standard data augmentation together with the scrambling procedure provide us with enough samples to fine-tune the auxiliary block. Of course minimizing the jigsaw loss means running multiple SGD iterations for the network parameter updates.

    Table 5 extends Table 2 of the paper, showing how subsequent iterations of the auxiliary block optimization process always introduces an improvement with respect to GeS. We executed three runs for each experiment and the results indicate that the advantage is always present in each single experiment and it is not just an effect visible on average. We underline that, although also the method JiGen [jigen] exploits self-supervised knowledge for domain generalization, it does not consider the possibility to adapt the network at test time on each target sample. Indeed, its flat multi-task structure would imply an overall update of the network, while with GeOS we can focus on adapting exclusively the auxiliary knowledge block with a larger benefit on the obtained DG accuracy as shown in Table 1 of the main submission.

2 Related Work

Figure 2: Overview of DA and DG most recent literature sorted on the basis of the amount of data annotation needed.

Self-Supervised Learning

SSL is a framework developed to learn visual features from large-scale unlabeled data [SSLsurvey]. Its first step is the choice of a pretext task that exploits inherent data attributes to automatically generate data labels. It has been shown that the semantic knowledge captured by the first layers of a network solving those tasks defines a useful initialization for new learning problems. Indeed the second SSL step consists in transferring the self-supervised learned model of those initial layers to a real downstream task (classification, detection), while the ending part of the network is newly trained. The advantage provided by the transferred model generally gets more evident, as the number of annotated samples of the downstream task is low.

The pretext tasks can be organized in three main groups. One group rely only on original visual cues and involves either the whole image with geometric transformations (translation, scaling, rotation [gidaris2018unsupervised, NIPS2014_geometric]), clustering [caron2018deep], inpainting [pathakCVPR16context]

and colorization

[zhang2016colorful], or considers image patches focusing on their equivariance (learning to count [learningtocount]) and relative position (solving jigsaw puzzles [NorooziF16, Noroozi_2018_CVPR]). A second group uses external sensory information either real or synthetic: this solution is often applied for multi-cue (visual-to-audio [audiovisual], RGB-to-depth [ren-cvpr2018]) and robotic data [grasp2vec, visiontouch]. Finally, the third group relies on video and on the regularities introduced by the temporal dimension [Wang_UnsupICCV2015, SSLvideo]. The most recent SSL research trends are mainly two. On one side there is the proposal of novel pretext tasks, compared on the basis of their ability to initialize a downstream task with respect to using supervised models as in standard transfer learning [OquabTL, YosinskiNIPS2014, FTmedical, SpotTune, Long:2015, liICML18]. On the other side there are new approaches to combine multiple pretext tasks together in multi-task settings [multitaskSSL, ren-cvpr2018].

Domain Adaptation and Generalization

To cope with domain shift, several algorithms have been developed mainly in two different settings. In DA the learning process has access to the labeled source data and to the unlabeled target data, thus the aim is to generalize to that specific target set [csurka_book]. The semi-supervised DA case considers also the availability of a limited number of annotated target samples [Saenko:2010, KulisSD11, doretto2017, semisup1, semisup2, ijcai2018, museumECCV18]. In DG the target is unknown at training time: the learning process can usually leverage on multiple sources to define a model robust to any new, previously unseen domain [shallowDG]. In both DA and DG, the main assumption is that source and target share the same label set, with very few works studying exceptions to this basic condition [PADA_eccv18, cocktail_CVPR18, Saito_2018_ECCV].

Feature-level strategies focus on learning domain invariant data representations mainly by minimizing different domain shift measures [Long:2015, LongZ0J17, dcoral, hdivergence]

. The domain shift can also be reduced by training a domain classifier and inverting the optimization to guide the features towards maximal domain confusion

[Ganin:DANN:JMLR16, Hoffman:Adda:CVPR17]. This adversarial approach has several variants, some of which also exploit class-specific domain recognition modules [saito2017maximum, Li_2018_ECCV]. Metric learning [doretto2017]

and deep autoencoders

[DGautoencoders, Li_2018_CVPR, Bousmalis:DSN:NIPS16] have also been used to search for domain-shared embedding spaces. In DG, these approaches leverage on the availability of multiple sources and on the access to the domain label for each sample, meaning that the identity of the source distribution from which every sample is drawn is strictly needed.

Model-level strategies either change how the data are loaded with ad-hoc episodes [hospedales19], or modify conventional learning algorithms to search for more robust minima of the objective function [MLDG_AAA18], or introduce domain alignment layers in standard learning networks [carlucci2017auto]. Those layers can also be used in multi-source DA to evaluate the relation between the sources and target and then perform source model weighting [MassiRAL, cocktail_CVPR18]. Several DG approaches aim at identifying and neglecting domain-specific signatures from multiple sources both through shallow and deep methods that exploit multi-task learning [ECCV12_Khosla], low-rank network parameter decomposition [hospedalesPACS, Ding2017DeepDG] or aggregation layers [Antonio_GCPR18, hospedales19]. In multi-source DA the domain label of the sources may be unknown [mancini2018boosting, hoffman_eccv12, carlucci2017auto], while for the DG it remains a crucial information that has to be provided since the beginning.

Finally, many recent methods adopt data-level solutions based on variants of the Generative Adversarial Networks (GANs, [Goodfellow:GAN:NIPS2014]) to synthesize new images. Indeed, it is possible to reduce the domain gap by producing source-like target images or/and target-like source images [russo17sbadagan, cycada], as well as a sequence of intermediate samples shifting from the source to the target [DLOW]. The data augmentation strategies in [DG_ICLR18, Volpi_2018_NIPS] learn how to properly perturb the source samples, even in the challenging case of DG from a single source. The combination of data- and feature-level strategies has also shown further improvements in results [ADAGE, sankaranarayanan2017generate].

Some recent works have started investigating intermediate settings between DA and DG. In Predictive DA a labeled source and several auxiliary unlabeled domains are available at training time together with metadata that describe their relation [adagraph, multivariatereg]. Other works propose approaches to push model-based DA solutions towards the DG setting adding a memory able to accumulate over multiple target samples at test time [adagraph, MassiRAL]. Although it is an interesting direction for online and continuous learning, it might only be seen as an upper limit condition to real DG in the wild, where we need a separate prediction for every sample. Moreover, [jigen] has recently started a new research direction moving SSL from the transfer learning to the domain generalization setting, showing that self-supervision provides useful auxiliary information to close the domain gap. In particular it showed that solving jigsaw puzzles improves the generalization properties of a supervised classification when both the models are jointly learned with a flat multi-task approach.

Multi-Task and Auxiliary Learning

MTL aims at simultaneously training over several tasks that mutually help each other [CaruanaMTL]. In deep learning this means searching for a single feature representation that works well for multiple tasks. This framework is at the basis of many CNN segmentation and detection algorithms [SimultaneousECCV14, FRCNN]. Several architectures have been investigated to better exploit inter-task connections and task-knowledge complementarity, while growing the number of combined tasks [Kokkinos_2017_CVPR, stitch, AZ_SS]

. Although powerful, MTL has one main drawback: it is sensitive to the weight assigned to each task, the choice of the scaling coefficient used to combine multiple loss weights. This causes the need for an extensive hyperparameter tuning

[Kokkinos_2017_CVPR] or for principled loss weighting strategies. Some recent approaches leverages on the evaluation of task uncertainty [fullyadaptive, kendall2017multi] and dynamically adjust the weights [gradnorm, Guo_2018_ECCV].

In many real applications the tasks are not all equally important and some prior knowledge on their ranking is available. In particular, the case with one main primary and several other auxiliary tasks is known as Auxiliary Learning (AL) and is related to the literature on learning with priviledged information [LUPI]. Very recently [NIPS2018ROCK] presented a residual strategy to integrate multi-modal auxiliary tasks and improve the performance of the primary object detection task. In [auxtasks] the main focus is in the choice of auxiliary tasks which should be as cheap as possible in terms of annotation and learning effort. This research direction is currently attracting more and more attention with also the introduction of unsupervised [aux2] and self-supervised [aux3] auxiliary tasks.

3 Generalize from One Sample

Figure 3: Schematic illustration of GeS architecture for learning with self-supervised auxiliary information. The primary network is trained on the supervised task. The auxiliary component refines features for the main classifier, while being trained to solve the Jigsaw Puzzle problem. Lines indicate feature paths in the network. A dotted line means gradients won’t be computed for the underlying layers.

The standard DG problem setting considers source domains, each with image-label pairs . The goal is learning a model that generalizes to any test sample drawn from a new target. The source domain index is needed by most of the existing DG algorithms, which use it to separate source-specific from source-generic knowledge. We choose instead to ignore it and deal directly with samples with where , focusing only on the class annotation . Moreover, by operating simple geometric transformations on , we can get a variety of new versions , with . Examples of transformations may be °rotations that lead to possible versions of each sample [gidaris2018unsupervised], or -patch based decomposition and shuffling as in a jigsaw puzzle, that leads to variants for each sample [jigen]. The obtained self-supervised data-label pairs where with , allow to define an auxiliary classification task that can be trained jointly with the primary one , improving its generalization effect across multiple sources.

Training Process

The general architecture of our model is shown in Figure 3. It is composed by a main convolutional backbone that extracts the features from the original images . It then provides them as input to the fully connected module of the primary task, that is in charge of computing the classification prediction. To this fairly general network, we add a new residual auxiliary block that deals with self-supervised data-label pairs. We focus on the jigsaw puzzle task, following the same approach used in [jigen]. In particular, the original images are decomposed through a regular grid in tiles which are then randomly re-assigned to one of the grid positions (Figure 3, bottom left). Out of all the possible permutations, we considered a set of cases, using the Hamming distance based algorithm in [NorooziF16]

. Thus, the auxiliary block takes as input the features extracted by the fully connected part of the main network from the scrambled images

. It then further process them through few extra convolutional layers, before entering the final fully connected auxiliary classification module that recognizes the puzzle permutation. We indicate with the features encoded by the auxiliary block (from the original images) that contribute back to the primary task representation. Indeed, the input to the fully connected module of the primary network is the element-wise sum .

We underline that, although the primary and the auxiliary tasks share the initial feature extraction process and present the described final feature recombination point, they are actually optimized independently. By indicating with the cross-entropy loss of the primary task and with the cross-entropy loss of the auxiliary jigsaw task, we overall train the network by optimizing the two following objectives:

(1)
(2)

To summarize it in words, the gradients of the auxiliary loss do not backpropagate into the primary network, and the gradients of the primary loss affect the auxiliary module only indirectly through the update of the initial convolutional part of the main network.

Testing Process and One Sample Learning

Given a test sample from an unknown target domain we extract both the primary and the auxiliary features from it to feed the classification model, get the prediction and check whether the assigned class is correct or not. With respect to this naïve testing process, the self-supervised nature of the auxiliary task gives us the possibility to further learn from the the single available test sample. Indeed we can always decompose the sample in patches to create its shuffled variants and further minimize the auxiliary puzzle classification loss. In this way the auxiliary block is fine-tuned on the single observed example and we can expect a benefit from recombining the auxiliary features with those of the primary model. The exact procedure of auxiliary learning from one sample at test time is described in Algorithm 1.

Data: source trained model, test sample
, source trained model
while still iterations do
       (, ) generate random self-supervised mini-batch from test sample variants
       minimize the loss
       update
      
end while
predict label of test sample using and
Algorithm 1 One Test Sample Learning

Implementation Details

We instantiate the main network backbone as a ResNet18 architecture and use a standard residual block as our auxiliary self-supervised module. Specifically, the auxiliary block implements a fully connected layer after the last convolution for self-supervised predictions. The main network, parametrized by

, is initialized with a pre-trained ImageNet model, while for the auxiliary block parametrized by

, we use random uniform weights. The output of the main network and that of the auxiliary block are aggregated with a plain element-wise sum. For each training iteration we feed the network with mini-batches of original and transformed images using batch accumulation to synchronously update and . Our architecture has a similar structure to the one recently presented in [NIPS2018ROCK], but we implemented a tailored backpropagation policy to keep separated the primary and the auxiliary learning process by zeroing the gradients at both the input and output ends of the auxiliary block.

4 Experiments

Datasets

The proposed GeOS algorithm is mainly designed to work in the DG setting with data from multiple sources, using only the sample category labels and ignoring the domain annotation. In other words, GeOS works with the same amount of data knowledge of the naïve Source Only reference, also known as Deep All since a basic CNN network can be trained on the overall aggregated source samples.

To test GeOS in the DG scenario we focused on the PACS dataset [hospedalesPACS] that contains approximately 10.000 images of 7 common categories across 4 different domains (Photo, Art painting, Cartoon, Sketch) characterized by large visual shifts. We further investigate the behaviour of GeOS in the multi-source DA setting with the same dataset, always considering one domain as target and the other three as sources.

Finally we evaluate GeOS in Predictive Domain Adaptation (PDA), a particular DG setting that has been recently put under the spotlight by [adagraph]. Here a single labeled data is available at training time together with a set of unlabeled auxiliary domains which are provided together with extra metadata (image timestamp, camera pose, ) useful to derive the reciprocal relation among the auxilary sets and the labeled source. For PDA we follow [adagraph], testing on CompCars [Yang_2015_CVPR] and Portraits [Ginosar_2015_ICCV_Workshops]. The first one is a large-scale dataset composed of 136,726 vehicle photos taken in the space of 11 years (from 2004 to 2015). As in [adagraph], we selected a subset of 24,151 images organized in 4 classes (type of vehicle: MPV, SUV, sedan and hatchback) and 30 domains obtained from the combination of the year of production (range between 2009 and 2014) and the perspective of the vehicle (5 different view points). The second dataset is a large collection of pictures taken from American high school year-books. The photos cover a time range between 1905 and 2013 over 26 American states. Also in this case we follow [adagraph] for the experimental protocol: we define a gender classification task performed on 40 domains obtained choosing 8 decades (from 1934) and 5 regions (New England, Mid Atlantic, Mid West, Pacific and Southern).

Domain Generalization

To align our PACS experiments with the training procedure used in [jigen], we apply random cropping, random horizontal flipping, photometric distortions and resize crops to 222222 so that we get equally spaced square tiles on a 3

3 grid for the jigsaw puzzle task. We train the network for 40 epochs using SGD with momentum set at 0.9, an initial learning rate of 0.001, a cumulative batch size of 128 original images and 128 shuffled images and a weight decay of

. We divide train inputs in 90% train and 10% validation splits, and test on the target with the best performing model on the validation set. By indicating the auxiliary task loss weight with , we achieve the training convergence for the self-supervised task by assigning , and use that value for all our experiments, including DA and PDA settings, without further optimization. We also leave hyperparameters for the one-sample finetuting steps fixed to their initial training values.

The obtained results are shown in Table 1, together with several useful baselines. In particular, JiGen [jigen] was the first method showing that self-supervision tasks can support domain generalization, while D-SAM [Antonio_GCPR18] and EPI-FCR [hospedales19] propose networks with domain specific aggregation layers and domain specific models respectively, with the second one introducing also a particular episodic training procedure and getting the current DG state of the art on PACS. DANN [ganin2014unsupervised] exploits a domain adversarial loss to obtain a source invariant feature representation. MLDG [MLDG_AAA18] is a meta-learning based optimization method. We underline that all these baseline, with the notable exception of JiGen, need source data provided with both class and domain label. On this basis, the advantage that GeOS shows with respect to EPI-FCR is even more significant. Since also JiGen leverages over self-supervised knowledge, it might benefit of the One Sample Learning procedure at test time as in GeOS. For a fair comparison we used the code provided by the authors, implementing and running on it our Algorithm 1. The row JiGen + OS reports the obtained results, showing a small advantage over the original JiGen, confirming the beneficial effect of the fine tuning procedure. However the gain is still limited with respect to the top result of GeOS: the flat multi-task architecture of JiGen implies a re-adaptation of the whole network which might be out of reach with a single target sample. This confirms the effectiveness of the chosen auxiliary learning structure chosen for GeOS.

PACS-DG art_paint. cartoon sketches photo Avg.
Resnet-18
[Antonio_GCPR18] Deep All 77.87 75.89 69.27 95.19 79.55
D-SAM 77.33 72.43 77.83 95.30 80.72
[hospedales19] Deep All 77.60 73.90 70.30 94.40 79.10
DANN 81.30 73.80 74.30 94.00 80.08
MLDG 79.50 77.30 71.50 94.30 80.70
EPI-FCR 82.10 77.00 73.00 93.90 81.50
[jigen] Deep All 77.85 74.86 67.74 95.73 79.05
JiGen 79.4 75.25 71.35 96.03 80.51
JiGen + OS 79.40 75.24 72.26 96.27 80.79
GeOS 79.79 75.06 76 96.65 81.88
Table 1: Domain Generalization results on PACS. The results of GeOS are average over 3 repetitions of each run. Each column title indicates the name of the domain used as target.
PACS-DG art_paint. cartoon sketches photo Avg.
Resnet-18
null hypothesis 79.26 74.09 70.13 96.23 79.93
GeS 78.95 74.36 73.99 96.29 80.90
79.74 74.84 75.35 96.53 81.62
79.79 75.01 76 96.61 81.85
79.79 75.06 76 96.65 81.88
79.49 74.11 70.6 95.87 79.52
78.19 74.81 71.63 95.79 80.11
Table 2: Analysis of several variants of GeOS: not using the auxiliary knowledge, turning off the one sample finetuning at test time, increasing the number of self-supervised iterations on the target sample and also changing the self-supervised task from solving jigsaw puzzles to image rotation recognition.

Analysis and Discussion

We provide a further in-depth analysis of the proposed method, starting from the results in Table 2. First of all we trained the same network architecture of GeOS but without using the auxiliary self-superivised data: in this case we start from the same hyperparameter initialization setting used for GeOS but we turn on the gradient propagation over the auxiliary network block which now behaves as an extra residual layer for the main primary model. The row null hypothesis in the table indicates that the advantage of GeOS is not due to the increased depth and parameter count, but originates instead from the proper use of self-supervision and one sample fine tuning. To even decouple these last two components, we turn off the one sample learning procedure at test time: the obtained version GeS of our algorithm still outperform JiGen and many of the other competitive methods in Table 1, that yet use more data annotation.

When the fine tuning procedure on the test sample is on, it is possible to optimize the auxiliary network block with a different number of SGD iterations. We show that the obtained results increase with the number of iterations, but are already remarkable with a single one. Finally we evaluate the effectiveness of GeOS and its simplified version GeS when changing the type of self-supervised knowledge used as auxiliary information. Precisely we follow [gidaris2018unsupervised] and rotate the images at steps of 90°, training the auxiliary block for recognition among the four orientations. In this case GeS does not provide any advantage with respect to the null hypothesis baseline. This reveals that the choice of the self-supervised task influences the generalization capabilities of our approach, but the possibility to still run fine tuning on every single test sample maintains a beneficial effect.

PACS-DA art_paint. cartoon sketches photo Avg.
Resnet-18
[mancini2018boosting] Deep All 74.70 72.40 60.10 92.90 75.03
Dial 87.30 85.50 66.80 97.00 84.15
DDiscovery 87.70 86.90 69.60 97.00 85.30
[jigen] Deep All 77.85 74.86 67.74 95.73 79.05
JiGen 84.88 81.07 79.05 97.96 85.74
GeS 80.96 77.56 78.78 97.39 83.67
Table 3: Multi-source Domain Adaptation results on PACS obtained as average over 3 repetitions for each run.

Unsupervised Domain Adaptation

Although designed for DG, our learning approach can also be used in the DA setting. To test its performance, we run experiments on PACS as already done in [jigen]. We choose the same training hyperparameters used in DG experiments, with the difference that we train the self-supervised task using images from the target unlabeled domain only, and we validate the network on the self-supervised jigsaw puzzle task using an held-out split from the target. Since all the target data are now available at once, the one sample finetuning strategy is superfluous, thus we fall back to the simplified GeS version of our approach. Even just exploiting the self-supervised knowledge and not using any explicit domain adaptation strategy, results in Table 3 show that GeS reduces the domain gap with the target domain, yielding an accuracy increase of more than 4 percentage points over the Deep All baseline.

Both DDiscovery [mancini2018boosting] and Dial [carlucci2017auto] are methods that can be applied on the whole set of source samples without their domain label, as well as JiGen [jigen], thus the comparison with GeS here is fair in terms of data annotation involved. However, it is useful to remark that all those methods minimize an extra entropy loss on the target data. Although it might be beneficial for adaptation, this further learning condition is not applicable in the DG setting and introduces a further computational burden due to the need of tuning the relative loss weight to adjust its relevance with respect to the other losses already included in the training model. For a better understanding, we focus on JiGen and analyze its behaviour when changing the entropy loss weight . The obtained performance is presented in Figure 4 and clearly indicate that JiGen is fairly sensitive to , besides having overall more ad-hoc hyperparameters that GeS and GeOS.

Figure 4: Analysis of JiGen in the PACS DA setting. The parameter weights the entropy loss that involves the target data. Moreover the method exploits two different () auxiliary loss weights related to the self-supervised task, besides the parameter used to regulate the data loading procedure for original and shuffled images.
Resnet-18
Method CompCars Portraits-Dec. Portraits-Reg.
Baseline 56.8 82.3 89.2
AdaGraph 58.8 87.0 91.0
GeS 60.2 87.1 91.6
GeOS 60.0 87.1 91.5
Table 4: Predictive DA results.

Predictive DA without metadata

The minimal need of supervision of GeOS puts it in a particularly profitable condition with respect to other existing DG methods in the challenging Predictive DA experimental setting. Indeed GeOS can ignore the availability of metadata and exploit directly the large scale unlabeled auxiliary sources. We compare the performance of our method against AdaGraph [adagraph]

, a very recent approach that exploits domain-specific batch-normalization layers to learn models for each source domain in a graph, where the graph is provided on the basis of the source auxiliary metadata.

We follow the experimental protocol described in [adagraph]. For CompCars, we select a pair of domains as source and target and use the remaining 28 as auxiliary unlabeled data. Considering all possible domain pairs, we get 870 experiments and observe the average accuracy results over all of them. A similar setting is applied for Portraits, for which we consider the across decades scenario (source and target domains selected from the same decade) and the across region scenario (source and target from the same region). In total we run 440 experiments for Portraits.

More in details, for CompCars, we start from an ImageNet pretrained model and trained for 6 epochs on source domain using Adam as optimizer with weight decay of . The batch size used is 16 and the learning rate is for the classifier and for the rest of the network; the learning rate is decayed by a factor of 10 after 4 epochs. In the case of Portraits the main learning procedure remains the same used above, except for the number of epochs that in this case is 1 and for the jigsaw weight that in this case was set to for the experiments across decades and to for the experiments across regions.

Table 4 show the obtained results, indicating that GeS outperforms AdaGraph in all settings, despite using much less annotated information. In this particular setting, turning on the fine tuning process on a each target sample is irrelevant: the amount of auxiliary source data is so abundant that the self-supervised auxiliary task is already providing its best generalization effect, thus GeOS does not show any further advantage with respect to GeS.

5 Conclusions

This paper presented the first algorithm for domain generalization able to learn from target data at test time, as images are presented for classification. We do so by learning regularities about target data as auxiliary task through self-supervision. The algorithm is very general and can be used with success in several settings, from classic domain adaptation to domain generalization, up to scenarios considering the possibility to access side domains [adagraph]. Moreover, the principled AL framework leads to a notable stability of the method with respect to the choice of its hyperparameters, a highly desirable feature from deployment in realistic settings. Future work will further investigate this new generalization scenario, studying the behaviour of the approach with respect to the amount and the quality of unsupervised data available at training time.

References

Supplementary Material

We provide here an extended discussion and further evidences of the advantage introduced by learning to generalize one sample at a time through the proposed auxiliary self-supervised finetuning process. First of all we clarify the difference between our full method named GeOS and its simplified version GeS.

GeS is the architecture we designed for deep learning Generalization by exploiting Self-Supervision. Its structure is depicted in Figure 3 of the paper. Besides the main network backbone that tackles the primary classification task, we introduce an auxiliary block that deals with the self-supervised objective. It provides useful complementary features that are finally recombined with those of the main network improving the robustness of the primary model.

We mostly focused on the jigsaw puzzle self-supervised task, thus our auxiliary data are scrambled version of the original images, recomposed with their own patches in disordered positions. This specific formalization for jigsaw puzzle was recently introduced in [jigen], where the method JiGen learns jointly over the ordered and the shuffled images with a flat multi-task architecture. Although it showed to be effective, this approach substantially disregards the warnings highlighted in [NorooziF16] about the need of avoiding shortcuts that exploit low-level image statistics rather than high-level semantic knowledge while solving a self-supervised task. By fitting the self-supervised knowledge extraction in an auxiliary learning framework, GeS keeps the beneficial effect provided by self-supervision without the risk of confusing the learning process with low-level jigsaw puzzle specific information. Indeed the auxiliary knowledge is extracted by a dedicated residual block towards the end of the network together with a tailored backpropagation strategy that keeps the primary and the auxiliary tasks synchronized but separated in their specific objectives.

In the DA setting, the auxiliary block of GeS is trained exclusively by the target images, that are available at training time but are unlabeled. In this case when the ordered source images enter the auxiliary block we obtain target-style-based features that allow to bridge the gap across domains. Indeed, these features are recombined with the ones from the main backbone and together guide the learning process of the primary classification model. In DG, the ordered source images are provided as input to the primary classification task, while their scrambled versions are fed to the auxiliary block. Thus the scrambled source images train the auxiliary block, that is finally used as a complementary feature extractor for the respective ordered images.

At test time, for DA each of the ordered target images pass both through the main and through the auxiliary block for feature extraction using the network model obtained at the end of the training phase. In DG, for each target sample we may follow the same procedure used for DA testing. However, we can also do more by leveraging on self-supervision to distill further knowledge.

 target run GeS GeOS GeOS GeOS
 photo 1 96.41 +0.12 +0.18 +0.30
2 96.41 +0.30 +0.48 +0.36
3 96.05 +0.30 +0.30 +0.42
 art_paint. 1 79.00 +0.74 +1.13 +1.37
2 78.71 +0.83 +0.33 +0.09
3 79.15 +0.78 +0.64 +0.64
 cartoon 1 73.72 +0.85 +0.67 +0.79
2 74.23 +0.17 +0.34 +0.39
3 75.13 +0,42 +0.55 +0.51
 sketches 1 73.45 +1.53 +2.04 +2.04
2 74.14 +1.35 +2.16 +1.81
3 74.37 +1.20 +1.83 +2.19
Table 5: PACS-DG accuracy gains of GeOS over GeS when finetuning separately over each target sample at test time with an increasing number of iterations.

GeOS is our full method that exploits the architecture of GeS and runs a fine-tuning procedure at test time for each target sample in the DG setting. The target image is scrambled and provided as input to the auxiliary block which is initialised with the model obtained from the scrambled source images at training time. Although starting from a single instance, the standard data augmentation together with the scrambling procedure provide us with enough samples to fine-tune the auxiliary block. Of course minimizing the jigsaw loss means running multiple SGD iterations for the network parameter updates.

Table 5 extends Table 2 of the paper, showing how subsequent iterations of the auxiliary block optimization process always introduces an improvement with respect to GeS. We executed three runs for each experiment and the results indicate that the advantage is always present in each single experiment and it is not just an effect visible on average. We underline that, although also the method JiGen [jigen] exploits self-supervised knowledge for domain generalization, it does not consider the possibility to adapt the network at test time on each target sample. Indeed, its flat multi-task structure would imply an overall update of the network, while with GeOS we can focus on adapting exclusively the auxiliary knowledge block with a larger benefit on the obtained DG accuracy as shown in Table 1 of the main submission.

3 Generalize from One Sample

Figure 3: Schematic illustration of GeS architecture for learning with self-supervised auxiliary information. The primary network is trained on the supervised task. The auxiliary component refines features for the main classifier, while being trained to solve the Jigsaw Puzzle problem. Lines indicate feature paths in the network. A dotted line means gradients won’t be computed for the underlying layers.

The standard DG problem setting considers source domains, each with image-label pairs . The goal is learning a model that generalizes to any test sample drawn from a new target. The source domain index is needed by most of the existing DG algorithms, which use it to separate source-specific from source-generic knowledge. We choose instead to ignore it and deal directly with samples with where , focusing only on the class annotation . Moreover, by operating simple geometric transformations on , we can get a variety of new versions , with . Examples of transformations may be °rotations that lead to possible versions of each sample [gidaris2018unsupervised], or -patch based decomposition and shuffling as in a jigsaw puzzle, that leads to variants for each sample [jigen]. The obtained self-supervised data-label pairs where with , allow to define an auxiliary classification task that can be trained jointly with the primary one , improving its generalization effect across multiple sources.

Training Process

The general architecture of our model is shown in Figure 3. It is composed by a main convolutional backbone that extracts the features from the original images . It then provides them as input to the fully connected module of the primary task, that is in charge of computing the classification prediction. To this fairly general network, we add a new residual auxiliary block that deals with self-supervised data-label pairs. We focus on the jigsaw puzzle task, following the same approach used in [jigen]. In particular, the original images are decomposed through a regular grid in tiles which are then randomly re-assigned to one of the grid positions (Figure 3, bottom left). Out of all the possible permutations, we considered a set of cases, using the Hamming distance based algorithm in [NorooziF16]

. Thus, the auxiliary block takes as input the features extracted by the fully connected part of the main network from the scrambled images

. It then further process them through few extra convolutional layers, before entering the final fully connected auxiliary classification module that recognizes the puzzle permutation. We indicate with the features encoded by the auxiliary block (from the original images) that contribute back to the primary task representation. Indeed, the input to the fully connected module of the primary network is the element-wise sum .

We underline that, although the primary and the auxiliary tasks share the initial feature extraction process and present the described final feature recombination point, they are actually optimized independently. By indicating with the cross-entropy loss of the primary task and with the cross-entropy loss of the auxiliary jigsaw task, we overall train the network by optimizing the two following objectives:

(1)
(2)

To summarize it in words, the gradients of the auxiliary loss do not backpropagate into the primary network, and the gradients of the primary loss affect the auxiliary module only indirectly through the update of the initial convolutional part of the main network.

Testing Process and One Sample Learning

Given a test sample from an unknown target domain we extract both the primary and the auxiliary features from it to feed the classification model, get the prediction and check whether the assigned class is correct or not. With respect to this naïve testing process, the self-supervised nature of the auxiliary task gives us the possibility to further learn from the the single available test sample. Indeed we can always decompose the sample in patches to create its shuffled variants and further minimize the auxiliary puzzle classification loss. In this way the auxiliary block is fine-tuned on the single observed example and we can expect a benefit from recombining the auxiliary features with those of the primary model. The exact procedure of auxiliary learning from one sample at test time is described in Algorithm 1.

Data: source trained model, test sample
, source trained model
while still iterations do
       (, ) generate random self-supervised mini-batch from test sample variants
       minimize the loss
       update
      
end while
predict label of test sample using and
Algorithm 1 One Test Sample Learning

Implementation Details

We instantiate the main network backbone as a ResNet18 architecture and use a standard residual block as our auxiliary self-supervised module. Specifically, the auxiliary block implements a fully connected layer after the last convolution for self-supervised predictions. The main network, parametrized by

, is initialized with a pre-trained ImageNet model, while for the auxiliary block parametrized by

, we use random uniform weights. The output of the main network and that of the auxiliary block are aggregated with a plain element-wise sum. For each training iteration we feed the network with mini-batches of original and transformed images using batch accumulation to synchronously update and . Our architecture has a similar structure to the one recently presented in [NIPS2018ROCK], but we implemented a tailored backpropagation policy to keep separated the primary and the auxiliary learning process by zeroing the gradients at both the input and output ends of the auxiliary block.

4 Experiments

Datasets

The proposed GeOS algorithm is mainly designed to work in the DG setting with data from multiple sources, using only the sample category labels and ignoring the domain annotation. In other words, GeOS works with the same amount of data knowledge of the naïve Source Only reference, also known as Deep All since a basic CNN network can be trained on the overall aggregated source samples.

To test GeOS in the DG scenario we focused on the PACS dataset [hospedalesPACS] that contains approximately 10.000 images of 7 common categories across 4 different domains (Photo, Art painting, Cartoon, Sketch) characterized by large visual shifts. We further investigate the behaviour of GeOS in the multi-source DA setting with the same dataset, always considering one domain as target and the other three as sources.

Finally we evaluate GeOS in Predictive Domain Adaptation (PDA), a particular DG setting that has been recently put under the spotlight by [adagraph]. Here a single labeled data is available at training time together with a set of unlabeled auxiliary domains which are provided together with extra metadata (image timestamp, camera pose, ) useful to derive the reciprocal relation among the auxilary sets and the labeled source. For PDA we follow [adagraph], testing on CompCars [Yang_2015_CVPR] and Portraits [Ginosar_2015_ICCV_Workshops]. The first one is a large-scale dataset composed of 136,726 vehicle photos taken in the space of 11 years (from 2004 to 2015). As in [adagraph], we selected a subset of 24,151 images organized in 4 classes (type of vehicle: MPV, SUV, sedan and hatchback) and 30 domains obtained from the combination of the year of production (range between 2009 and 2014) and the perspective of the vehicle (5 different view points). The second dataset is a large collection of pictures taken from American high school year-books. The photos cover a time range between 1905 and 2013 over 26 American states. Also in this case we follow [adagraph] for the experimental protocol: we define a gender classification task performed on 40 domains obtained choosing 8 decades (from 1934) and 5 regions (New England, Mid Atlantic, Mid West, Pacific and Southern).

Domain Generalization

To align our PACS experiments with the training procedure used in [jigen], we apply random cropping, random horizontal flipping, photometric distortions and resize crops to 222222 so that we get equally spaced square tiles on a 3

3 grid for the jigsaw puzzle task. We train the network for 40 epochs using SGD with momentum set at 0.9, an initial learning rate of 0.001, a cumulative batch size of 128 original images and 128 shuffled images and a weight decay of

. We divide train inputs in 90% train and 10% validation splits, and test on the target with the best performing model on the validation set. By indicating the auxiliary task loss weight with , we achieve the training convergence for the self-supervised task by assigning , and use that value for all our experiments, including DA and PDA settings, without further optimization. We also leave hyperparameters for the one-sample finetuting steps fixed to their initial training values.

The obtained results are shown in Table 1, together with several useful baselines. In particular, JiGen [jigen] was the first method showing that self-supervision tasks can support domain generalization, while D-SAM [Antonio_GCPR18] and EPI-FCR [hospedales19] propose networks with domain specific aggregation layers and domain specific models respectively, with the second one introducing also a particular episodic training procedure and getting the current DG state of the art on PACS. DANN [ganin2014unsupervised] exploits a domain adversarial loss to obtain a source invariant feature representation. MLDG [MLDG_AAA18] is a meta-learning based optimization method. We underline that all these baseline, with the notable exception of JiGen, need source data provided with both class and domain label. On this basis, the advantage that GeOS shows with respect to EPI-FCR is even more significant. Since also JiGen leverages over self-supervised knowledge, it might benefit of the One Sample Learning procedure at test time as in GeOS. For a fair comparison we used the code provided by the authors, implementing and running on it our Algorithm 1. The row JiGen + OS reports the obtained results, showing a small advantage over the original JiGen, confirming the beneficial effect of the fine tuning procedure. However the gain is still limited with respect to the top result of GeOS: the flat multi-task architecture of JiGen implies a re-adaptation of the whole network which might be out of reach with a single target sample. This confirms the effectiveness of the chosen auxiliary learning structure chosen for GeOS.

PACS-DG art_paint. cartoon sketches photo Avg.
Resnet-18
[Antonio_GCPR18] Deep All 77.87 75.89 69.27 95.19 79.55
D-SAM 77.33 72.43 77.83 95.30 80.72
[hospedales19] Deep All 77.60 73.90 70.30 94.40 79.10
DANN 81.30 73.80 74.30 94.00 80.08
MLDG 79.50 77.30 71.50 94.30 80.70
EPI-FCR 82.10 77.00 73.00 93.90 81.50
[jigen] Deep All 77.85 74.86 67.74 95.73 79.05
JiGen 79.4 75.25 71.35 96.03 80.51
JiGen + OS 79.40 75.24 72.26 96.27 80.79
GeOS 79.79 75.06 76 96.65 81.88
Table 1: Domain Generalization results on PACS. The results of GeOS are average over 3 repetitions of each run. Each column title indicates the name of the domain used as target.
PACS-DG art_paint. cartoon sketches photo Avg.
Resnet-18
null hypothesis 79.26 74.09 70.13 96.23 79.93
GeS 78.95 74.36 73.99 96.29 80.90
79.74 74.84 75.35 96.53 81.62
79.79 75.01 76 96.61 81.85
79.79 75.06 76 96.65 81.88
79.49 74.11 70.6 95.87 79.52
78.19 74.81 71.63 95.79 80.11
Table 2: Analysis of several variants of GeOS: not using the auxiliary knowledge, turning off the one sample finetuning at test time, increasing the number of self-supervised iterations on the target sample and also changing the self-supervised task from solving jigsaw puzzles to image rotation recognition.

Analysis and Discussion

We provide a further in-depth analysis of the proposed method, starting from the results in Table 2. First of all we trained the same network architecture of GeOS but without using the auxiliary self-superivised data: in this case we start from the same hyperparameter initialization setting used for GeOS but we turn on the gradient propagation over the auxiliary network block which now behaves as an extra residual layer for the main primary model. The row null hypothesis in the table indicates that the advantage of GeOS is not due to the increased depth and parameter count, but originates instead from the proper use of self-supervision and one sample fine tuning. To even decouple these last two components, we turn off the one sample learning procedure at test time: the obtained version GeS of our algorithm still outperform JiGen and many of the other competitive methods in Table 1, that yet use more data annotation.

When the fine tuning procedure on the test sample is on, it is possible to optimize the auxiliary network block with a different number of SGD iterations. We show that the obtained results increase with the number of iterations, but are already remarkable with a single one. Finally we evaluate the effectiveness of GeOS and its simplified version GeS when changing the type of self-supervised knowledge used as auxiliary information. Precisely we follow [gidaris2018unsupervised] and rotate the images at steps of 90°, training the auxiliary block for recognition among the four orientations. In this case GeS does not provide any advantage with respect to the null hypothesis baseline. This reveals that the choice of the self-supervised task influences the generalization capabilities of our approach, but the possibility to still run fine tuning on every single test sample maintains a beneficial effect.

PACS-DA art_paint. cartoon sketches photo Avg.
Resnet-18
[mancini2018boosting] Deep All 74.70 72.40 60.10 92.90 75.03
Dial 87.30 85.50 66.80 97.00 84.15
DDiscovery 87.70 86.90 69.60 97.00 85.30
[jigen] Deep All 77.85 74.86 67.74 95.73 79.05
JiGen 84.88 81.07 79.05 97.96 85.74
GeS 80.96 77.56 78.78 97.39 83.67
Table 3: Multi-source Domain Adaptation results on PACS obtained as average over 3 repetitions for each run.

Unsupervised Domain Adaptation

Although designed for DG, our learning approach can also be used in the DA setting. To test its performance, we run experiments on PACS as already done in [jigen]. We choose the same training hyperparameters used in DG experiments, with the difference that we train the self-supervised task using images from the target unlabeled domain only, and we validate the network on the self-supervised jigsaw puzzle task using an held-out split from the target. Since all the target data are now available at once, the one sample finetuning strategy is superfluous, thus we fall back to the simplified GeS version of our approach. Even just exploiting the self-supervised knowledge and not using any explicit domain adaptation strategy, results in Table 3 show that GeS reduces the domain gap with the target domain, yielding an accuracy increase of more than 4 percentage points over the Deep All baseline.

Both DDiscovery [mancini2018boosting] and Dial [carlucci2017auto] are methods that can be applied on the whole set of source samples without their domain label, as well as JiGen [jigen], thus the comparison with GeS here is fair in terms of data annotation involved. However, it is useful to remark that all those methods minimize an extra entropy loss on the target data. Although it might be beneficial for adaptation, this further learning condition is not applicable in the DG setting and introduces a further computational burden due to the need of tuning the relative loss weight to adjust its relevance with respect to the other losses already included in the training model. For a better understanding, we focus on JiGen and analyze its behaviour when changing the entropy loss weight . The obtained performance is presented in Figure 4 and clearly indicate that JiGen is fairly sensitive to , besides having overall more ad-hoc hyperparameters that GeS and GeOS.

Figure 4: Analysis of JiGen in the PACS DA setting. The parameter weights the entropy loss that involves the target data. Moreover the method exploits two different () auxiliary loss weights related to the self-supervised task, besides the parameter used to regulate the data loading procedure for original and shuffled images.
Resnet-18
Method CompCars Portraits-Dec. Portraits-Reg.
Baseline 56.8 82.3 89.2
AdaGraph 58.8 87.0 91.0
GeS 60.2 87.1 91.6
GeOS 60.0 87.1 91.5
Table 4: Predictive DA results.

Predictive DA without metadata

The minimal need of supervision of GeOS puts it in a particularly profitable condition with respect to other existing DG methods in the challenging Predictive DA experimental setting. Indeed GeOS can ignore the availability of metadata and exploit directly the large scale unlabeled auxiliary sources. We compare the performance of our method against AdaGraph [adagraph]

, a very recent approach that exploits domain-specific batch-normalization layers to learn models for each source domain in a graph, where the graph is provided on the basis of the source auxiliary metadata.

We follow the experimental protocol described in [adagraph]. For CompCars, we select a pair of domains as source and target and use the remaining 28 as auxiliary unlabeled data. Considering all possible domain pairs, we get 870 experiments and observe the average accuracy results over all of them. A similar setting is applied for Portraits, for which we consider the across decades scenario (source and target domains selected from the same decade) and the across region scenario (source and target from the same region). In total we run 440 experiments for Portraits.

More in details, for CompCars, we start from an ImageNet pretrained model and trained for 6 epochs on source domain using Adam as optimizer with weight decay of . The batch size used is 16 and the learning rate is for the classifier and for the rest of the network; the learning rate is decayed by a factor of 10 after 4 epochs. In the case of Portraits the main learning procedure remains the same used above, except for the number of epochs that in this case is 1 and for the jigsaw weight that in this case was set to for the experiments across decades and to for the experiments across regions.

Table 4 show the obtained results, indicating that GeS outperforms AdaGraph in all settings, despite using much less annotated information. In this particular setting, turning on the fine tuning process on a each target sample is irrelevant: the amount of auxiliary source data is so abundant that the self-supervised auxiliary task is already providing its best generalization effect, thus GeOS does not show any further advantage with respect to GeS.

5 Conclusions

This paper presented the first algorithm for domain generalization able to learn from target data at test time, as images are presented for classification. We do so by learning regularities about target data as auxiliary task through self-supervision. The algorithm is very general and can be used with success in several settings, from classic domain adaptation to domain generalization, up to scenarios considering the possibility to access side domains [adagraph]. Moreover, the principled AL framework leads to a notable stability of the method with respect to the choice of its hyperparameters, a highly desirable feature from deployment in realistic settings. Future work will further investigate this new generalization scenario, studying the behaviour of the approach with respect to the amount and the quality of unsupervised data available at training time.

References

Supplementary Material

We provide here an extended discussion and further evidences of the advantage introduced by learning to generalize one sample at a time through the proposed auxiliary self-supervised finetuning process. First of all we clarify the difference between our full method named GeOS and its simplified version GeS.

GeS is the architecture we designed for deep learning Generalization by exploiting Self-Supervision. Its structure is depicted in Figure 3 of the paper. Besides the main network backbone that tackles the primary classification task, we introduce an auxiliary block that deals with the self-supervised objective. It provides useful complementary features that are finally recombined with those of the main network improving the robustness of the primary model.

We mostly focused on the jigsaw puzzle self-supervised task, thus our auxiliary data are scrambled version of the original images, recomposed with their own patches in disordered positions. This specific formalization for jigsaw puzzle was recently introduced in [jigen], where the method JiGen learns jointly over the ordered and the shuffled images with a flat multi-task architecture. Although it showed to be effective, this approach substantially disregards the warnings highlighted in [NorooziF16] about the need of avoiding shortcuts that exploit low-level image statistics rather than high-level semantic knowledge while solving a self-supervised task. By fitting the self-supervised knowledge extraction in an auxiliary learning framework, GeS keeps the beneficial effect provided by self-supervision without the risk of confusing the learning process with low-level jigsaw puzzle specific information. Indeed the auxiliary knowledge is extracted by a dedicated residual block towards the end of the network together with a tailored backpropagation strategy that keeps the primary and the auxiliary tasks synchronized but separated in their specific objectives.

In the DA setting, the auxiliary block of GeS is trained exclusively by the target images, that are available at training time but are unlabeled. In this case when the ordered source images enter the auxiliary block we obtain target-style-based features that allow to bridge the gap across domains. Indeed, these features are recombined with the ones from the main backbone and together guide the learning process of the primary classification model. In DG, the ordered source images are provided as input to the primary classification task, while their scrambled versions are fed to the auxiliary block. Thus the scrambled source images train the auxiliary block, that is finally used as a complementary feature extractor for the respective ordered images.

At test time, for DA each of the ordered target images pass both through the main and through the auxiliary block for feature extraction using the network model obtained at the end of the training phase. In DG, for each target sample we may follow the same procedure used for DA testing. However, we can also do more by leveraging on self-supervision to distill further knowledge.

 target run GeS GeOS GeOS GeOS
 photo 1 96.41 +0.12 +0.18 +0.30
2 96.41 +0.30 +0.48 +0.36
3 96.05 +0.30 +0.30 +0.42
 art_paint. 1 79.00 +0.74 +1.13 +1.37
2 78.71 +0.83 +0.33 +0.09
3 79.15 +0.78 +0.64 +0.64
 cartoon 1 73.72 +0.85 +0.67 +0.79
2 74.23 +0.17 +0.34 +0.39
3 75.13 +0,42 +0.55 +0.51
 sketches 1 73.45 +1.53 +2.04 +2.04
2 74.14 +1.35 +2.16 +1.81
3 74.37 +1.20 +1.83 +2.19
Table 5: PACS-DG accuracy gains of GeOS over GeS when finetuning separately over each target sample at test time with an increasing number of iterations.

GeOS is our full method that exploits the architecture of GeS and runs a fine-tuning procedure at test time for each target sample in the DG setting. The target image is scrambled and provided as input to the auxiliary block which is initialised with the model obtained from the scrambled source images at training time. Although starting from a single instance, the standard data augmentation together with the scrambling procedure provide us with enough samples to fine-tune the auxiliary block. Of course minimizing the jigsaw loss means running multiple SGD iterations for the network parameter updates.

Table 5 extends Table 2 of the paper, showing how subsequent iterations of the auxiliary block optimization process always introduces an improvement with respect to GeS. We executed three runs for each experiment and the results indicate that the advantage is always present in each single experiment and it is not just an effect visible on average. We underline that, although also the method JiGen [jigen] exploits self-supervised knowledge for domain generalization, it does not consider the possibility to adapt the network at test time on each target sample. Indeed, its flat multi-task structure would imply an overall update of the network, while with GeOS we can focus on adapting exclusively the auxiliary knowledge block with a larger benefit on the obtained DG accuracy as shown in Table 1 of the main submission.

4 Experiments

Datasets

The proposed GeOS algorithm is mainly designed to work in the DG setting with data from multiple sources, using only the sample category labels and ignoring the domain annotation. In other words, GeOS works with the same amount of data knowledge of the naïve Source Only reference, also known as Deep All since a basic CNN network can be trained on the overall aggregated source samples.

To test GeOS in the DG scenario we focused on the PACS dataset [hospedalesPACS] that contains approximately 10.000 images of 7 common categories across 4 different domains (Photo, Art painting, Cartoon, Sketch) characterized by large visual shifts. We further investigate the behaviour of GeOS in the multi-source DA setting with the same dataset, always considering one domain as target and the other three as sources.

Finally we evaluate GeOS in Predictive Domain Adaptation (PDA), a particular DG setting that has been recently put under the spotlight by [adagraph]. Here a single labeled data is available at training time together with a set of unlabeled auxiliary domains which are provided together with extra metadata (image timestamp, camera pose, ) useful to derive the reciprocal relation among the auxilary sets and the labeled source. For PDA we follow [adagraph], testing on CompCars [Yang_2015_CVPR] and Portraits [Ginosar_2015_ICCV_Workshops]. The first one is a large-scale dataset composed of 136,726 vehicle photos taken in the space of 11 years (from 2004 to 2015). As in [adagraph], we selected a subset of 24,151 images organized in 4 classes (type of vehicle: MPV, SUV, sedan and hatchback) and 30 domains obtained from the combination of the year of production (range between 2009 and 2014) and the perspective of the vehicle (5 different view points). The second dataset is a large collection of pictures taken from American high school year-books. The photos cover a time range between 1905 and 2013 over 26 American states. Also in this case we follow [adagraph] for the experimental protocol: we define a gender classification task performed on 40 domains obtained choosing 8 decades (from 1934) and 5 regions (New England, Mid Atlantic, Mid West, Pacific and Southern).

Domain Generalization

To align our PACS experiments with the training procedure used in [jigen], we apply random cropping, random horizontal flipping, photometric distortions and resize crops to 222222 so that we get equally spaced square tiles on a 3

3 grid for the jigsaw puzzle task. We train the network for 40 epochs using SGD with momentum set at 0.9, an initial learning rate of 0.001, a cumulative batch size of 128 original images and 128 shuffled images and a weight decay of

. We divide train inputs in 90% train and 10% validation splits, and test on the target with the best performing model on the validation set. By indicating the auxiliary task loss weight with , we achieve the training convergence for the self-supervised task by assigning , and use that value for all our experiments, including DA and PDA settings, without further optimization. We also leave hyperparameters for the one-sample finetuting steps fixed to their initial training values.

The obtained results are shown in Table 1, together with several useful baselines. In particular, JiGen [jigen] was the first method showing that self-supervision tasks can support domain generalization, while D-SAM [Antonio_GCPR18] and EPI-FCR [hospedales19] propose networks with domain specific aggregation layers and domain specific models respectively, with the second one introducing also a particular episodic training procedure and getting the current DG state of the art on PACS. DANN [ganin2014unsupervised] exploits a domain adversarial loss to obtain a source invariant feature representation. MLDG [MLDG_AAA18] is a meta-learning based optimization method. We underline that all these baseline, with the notable exception of JiGen, need source data provided with both class and domain label. On this basis, the advantage that GeOS shows with respect to EPI-FCR is even more significant. Since also JiGen leverages over self-supervised knowledge, it might benefit of the One Sample Learning procedure at test time as in GeOS. For a fair comparison we used the code provided by the authors, implementing and running on it our Algorithm 1. The row JiGen + OS reports the obtained results, showing a small advantage over the original JiGen, confirming the beneficial effect of the fine tuning procedure. However the gain is still limited with respect to the top result of GeOS: the flat multi-task architecture of JiGen implies a re-adaptation of the whole network which might be out of reach with a single target sample. This confirms the effectiveness of the chosen auxiliary learning structure chosen for GeOS.

PACS-DG art_paint. cartoon sketches photo Avg.
Resnet-18
[Antonio_GCPR18] Deep All 77.87 75.89 69.27 95.19 79.55
D-SAM 77.33 72.43 77.83 95.30 80.72
[hospedales19] Deep All 77.60 73.90 70.30 94.40 79.10
DANN 81.30 73.80 74.30 94.00 80.08
MLDG 79.50 77.30 71.50 94.30 80.70
EPI-FCR 82.10 77.00 73.00 93.90 81.50
[jigen] Deep All 77.85 74.86 67.74 95.73 79.05
JiGen 79.4 75.25 71.35 96.03 80.51
JiGen + OS 79.40 75.24 72.26 96.27 80.79
GeOS 79.79 75.06 76 96.65 81.88
Table 1: Domain Generalization results on PACS. The results of GeOS are average over 3 repetitions of each run. Each column title indicates the name of the domain used as target.
PACS-DG art_paint. cartoon sketches photo Avg.
Resnet-18
null hypothesis 79.26 74.09 70.13 96.23 79.93
GeS 78.95 74.36 73.99 96.29 80.90
79.74 74.84 75.35 96.53 81.62
79.79 75.01 76 96.61 81.85
79.79 75.06 76 96.65 81.88
79.49 74.11 70.6 95.87 79.52
78.19 74.81 71.63 95.79 80.11
Table 2: Analysis of several variants of GeOS: not using the auxiliary knowledge, turning off the one sample finetuning at test time, increasing the number of self-supervised iterations on the target sample and also changing the self-supervised task from solving jigsaw puzzles to image rotation recognition.

Analysis and Discussion

We provide a further in-depth analysis of the proposed method, starting from the results in Table 2. First of all we trained the same network architecture of GeOS but without using the auxiliary self-superivised data: in this case we start from the same hyperparameter initialization setting used for GeOS but we turn on the gradient propagation over the auxiliary network block which now behaves as an extra residual layer for the main primary model. The row null hypothesis in the table indicates that the advantage of GeOS is not due to the increased depth and parameter count, but originates instead from the proper use of self-supervision and one sample fine tuning. To even decouple these last two components, we turn off the one sample learning procedure at test time: the obtained version GeS of our algorithm still outperform JiGen and many of the other competitive methods in Table 1, that yet use more data annotation.

When the fine tuning procedure on the test sample is on, it is possible to optimize the auxiliary network block with a different number of SGD iterations. We show that the obtained results increase with the number of iterations, but are already remarkable with a single one. Finally we evaluate the effectiveness of GeOS and its simplified version GeS when changing the type of self-supervised knowledge used as auxiliary information. Precisely we follow [gidaris2018unsupervised] and rotate the images at steps of 90°, training the auxiliary block for recognition among the four orientations. In this case GeS does not provide any advantage with respect to the null hypothesis baseline. This reveals that the choice of the self-supervised task influences the generalization capabilities of our approach, but the possibility to still run fine tuning on every single test sample maintains a beneficial effect.

PACS-DA art_paint. cartoon sketches photo Avg.
Resnet-18
[mancini2018boosting] Deep All 74.70 72.40 60.10 92.90 75.03
Dial 87.30 85.50 66.80 97.00 84.15
DDiscovery 87.70 86.90 69.60 97.00 85.30
[jigen] Deep All 77.85 74.86 67.74 95.73 79.05
JiGen 84.88 81.07 79.05 97.96 85.74
GeS 80.96 77.56 78.78 97.39 83.67
Table 3: Multi-source Domain Adaptation results on PACS obtained as average over 3 repetitions for each run.

Unsupervised Domain Adaptation

Although designed for DG, our learning approach can also be used in the DA setting. To test its performance, we run experiments on PACS as already done in [jigen]. We choose the same training hyperparameters used in DG experiments, with the difference that we train the self-supervised task using images from the target unlabeled domain only, and we validate the network on the self-supervised jigsaw puzzle task using an held-out split from the target. Since all the target data are now available at once, the one sample finetuning strategy is superfluous, thus we fall back to the simplified GeS version of our approach. Even just exploiting the self-supervised knowledge and not using any explicit domain adaptation strategy, results in Table 3 show that GeS reduces the domain gap with the target domain, yielding an accuracy increase of more than 4 percentage points over the Deep All baseline.

Both DDiscovery [mancini2018boosting] and Dial [carlucci2017auto] are methods that can be applied on the whole set of source samples without their domain label, as well as JiGen [jigen], thus the comparison with GeS here is fair in terms of data annotation involved. However, it is useful to remark that all those methods minimize an extra entropy loss on the target data. Although it might be beneficial for adaptation, this further learning condition is not applicable in the DG setting and introduces a further computational burden due to the need of tuning the relative loss weight to adjust its relevance with respect to the other losses already included in the training model. For a better understanding, we focus on JiGen and analyze its behaviour when changing the entropy loss weight . The obtained performance is presented in Figure 4 and clearly indicate that JiGen is fairly sensitive to , besides having overall more ad-hoc hyperparameters that GeS and GeOS.

Figure 4: Analysis of JiGen in the PACS DA setting. The parameter weights the entropy loss that involves the target data. Moreover the method exploits two different () auxiliary loss weights related to the self-supervised task, besides the parameter used to regulate the data loading procedure for original and shuffled images.
Resnet-18
Method CompCars Portraits-Dec. Portraits-Reg.
Baseline 56.8 82.3 89.2
AdaGraph 58.8 87.0 91.0
GeS 60.2 87.1 91.6
GeOS 60.0 87.1 91.5
Table 4: Predictive DA results.

Predictive DA without metadata

The minimal need of supervision of GeOS puts it in a particularly profitable condition with respect to other existing DG methods in the challenging Predictive DA experimental setting. Indeed GeOS can ignore the availability of metadata and exploit directly the large scale unlabeled auxiliary sources. We compare the performance of our method against AdaGraph [adagraph]

, a very recent approach that exploits domain-specific batch-normalization layers to learn models for each source domain in a graph, where the graph is provided on the basis of the source auxiliary metadata.

We follow the experimental protocol described in [adagraph]. For CompCars, we select a pair of domains as source and target and use the remaining 28 as auxiliary unlabeled data. Considering all possible domain pairs, we get 870 experiments and observe the average accuracy results over all of them. A similar setting is applied for Portraits, for which we consider the across decades scenario (source and target domains selected from the same decade) and the across region scenario (source and target from the same region). In total we run 440 experiments for Portraits.

More in details, for CompCars, we start from an ImageNet pretrained model and trained for 6 epochs on source domain using Adam as optimizer with weight decay of . The batch size used is 16 and the learning rate is for the classifier and for the rest of the network; the learning rate is decayed by a factor of 10 after 4 epochs. In the case of Portraits the main learning procedure remains the same used above, except for the number of epochs that in this case is 1 and for the jigsaw weight that in this case was set to for the experiments across decades and to for the experiments across regions.

Table 4 show the obtained results, indicating that GeS outperforms AdaGraph in all settings, despite using much less annotated information. In this particular setting, turning on the fine tuning process on a each target sample is irrelevant: the amount of auxiliary source data is so abundant that the self-supervised auxiliary task is already providing its best generalization effect, thus GeOS does not show any further advantage with respect to GeS.

5 Conclusions

This paper presented the first algorithm for domain generalization able to learn from target data at test time, as images are presented for classification. We do so by learning regularities about target data as auxiliary task through self-supervision. The algorithm is very general and can be used with success in several settings, from classic domain adaptation to domain generalization, up to scenarios considering the possibility to access side domains [adagraph]. Moreover, the principled AL framework leads to a notable stability of the method with respect to the choice of its hyperparameters, a highly desirable feature from deployment in realistic settings. Future work will further investigate this new generalization scenario, studying the behaviour of the approach with respect to the amount and the quality of unsupervised data available at training time.

References

Supplementary Material

We provide here an extended discussion and further evidences of the advantage introduced by learning to generalize one sample at a time through the proposed auxiliary self-supervised finetuning process. First of all we clarify the difference between our full method named GeOS and its simplified version GeS.

GeS is the architecture we designed for deep learning Generalization by exploiting Self-Supervision. Its structure is depicted in Figure 3 of the paper. Besides the main network backbone that tackles the primary classification task, we introduce an auxiliary block that deals with the self-supervised objective. It provides useful complementary features that are finally recombined with those of the main network improving the robustness of the primary model.

We mostly focused on the jigsaw puzzle self-supervised task, thus our auxiliary data are scrambled version of the original images, recomposed with their own patches in disordered positions. This specific formalization for jigsaw puzzle was recently introduced in [jigen], where the method JiGen learns jointly over the ordered and the shuffled images with a flat multi-task architecture. Although it showed to be effective, this approach substantially disregards the warnings highlighted in [NorooziF16] about the need of avoiding shortcuts that exploit low-level image statistics rather than high-level semantic knowledge while solving a self-supervised task. By fitting the self-supervised knowledge extraction in an auxiliary learning framework, GeS keeps the beneficial effect provided by self-supervision without the risk of confusing the learning process with low-level jigsaw puzzle specific information. Indeed the auxiliary knowledge is extracted by a dedicated residual block towards the end of the network together with a tailored backpropagation strategy that keeps the primary and the auxiliary tasks synchronized but separated in their specific objectives.

In the DA setting, the auxiliary block of GeS is trained exclusively by the target images, that are available at training time but are unlabeled. In this case when the ordered source images enter the auxiliary block we obtain target-style-based features that allow to bridge the gap across domains. Indeed, these features are recombined with the ones from the main backbone and together guide the learning process of the primary classification model. In DG, the ordered source images are provided as input to the primary classification task, while their scrambled versions are fed to the auxiliary block. Thus the scrambled source images train the auxiliary block, that is finally used as a complementary feature extractor for the respective ordered images.

At test time, for DA each of the ordered target images pass both through the main and through the auxiliary block for feature extraction using the network model obtained at the end of the training phase. In DG, for each target sample we may follow the same procedure used for DA testing. However, we can also do more by leveraging on self-supervision to distill further knowledge.

 target run GeS GeOS GeOS GeOS
 photo 1 96.41 +0.12 +0.18 +0.30
2 96.41 +0.30 +0.48 +0.36
3 96.05 +0.30 +0.30 +0.42
 art_paint. 1 79.00 +0.74 +1.13 +1.37
2 78.71 +0.83 +0.33 +0.09
3 79.15 +0.78 +0.64 +0.64
 cartoon 1 73.72 +0.85 +0.67 +0.79
2 74.23 +0.17 +0.34 +0.39
3 75.13 +0,42 +0.55 +0.51
 sketches 1 73.45 +1.53 +2.04 +2.04
2 74.14 +1.35 +2.16 +1.81
3 74.37 +1.20 +1.83 +2.19
Table 5: PACS-DG accuracy gains of GeOS over GeS when finetuning separately over each target sample at test time with an increasing number of iterations.

GeOS is our full method that exploits the architecture of GeS and runs a fine-tuning procedure at test time for each target sample in the DG setting. The target image is scrambled and provided as input to the auxiliary block which is initialised with the model obtained from the scrambled source images at training time. Although starting from a single instance, the standard data augmentation together with the scrambling procedure provide us with enough samples to fine-tune the auxiliary block. Of course minimizing the jigsaw loss means running multiple SGD iterations for the network parameter updates.

Table 5 extends Table 2 of the paper, showing how subsequent iterations of the auxiliary block optimization process always introduces an improvement with respect to GeS. We executed three runs for each experiment and the results indicate that the advantage is always present in each single experiment and it is not just an effect visible on average. We underline that, although also the method JiGen [jigen] exploits self-supervised knowledge for domain generalization, it does not consider the possibility to adapt the network at test time on each target sample. Indeed, its flat multi-task structure would imply an overall update of the network, while with GeOS we can focus on adapting exclusively the auxiliary knowledge block with a larger benefit on the obtained DG accuracy as shown in Table 1 of the main submission.

5 Conclusions

This paper presented the first algorithm for domain generalization able to learn from target data at test time, as images are presented for classification. We do so by learning regularities about target data as auxiliary task through self-supervision. The algorithm is very general and can be used with success in several settings, from classic domain adaptation to domain generalization, up to scenarios considering the possibility to access side domains [adagraph]. Moreover, the principled AL framework leads to a notable stability of the method with respect to the choice of its hyperparameters, a highly desirable feature from deployment in realistic settings. Future work will further investigate this new generalization scenario, studying the behaviour of the approach with respect to the amount and the quality of unsupervised data available at training time.

References

Supplementary Material

We provide here an extended discussion and further evidences of the advantage introduced by learning to generalize one sample at a time through the proposed auxiliary self-supervised finetuning process. First of all we clarify the difference between our full method named GeOS and its simplified version GeS.

GeS is the architecture we designed for deep learning Generalization by exploiting Self-Supervision. Its structure is depicted in Figure 3 of the paper. Besides the main network backbone that tackles the primary classification task, we introduce an auxiliary block that deals with the self-supervised objective. It provides useful complementary features that are finally recombined with those of the main network improving the robustness of the primary model.

We mostly focused on the jigsaw puzzle self-supervised task, thus our auxiliary data are scrambled version of the original images, recomposed with their own patches in disordered positions. This specific formalization for jigsaw puzzle was recently introduced in [jigen], where the method JiGen learns jointly over the ordered and the shuffled images with a flat multi-task architecture. Although it showed to be effective, this approach substantially disregards the warnings highlighted in [NorooziF16] about the need of avoiding shortcuts that exploit low-level image statistics rather than high-level semantic knowledge while solving a self-supervised task. By fitting the self-supervised knowledge extraction in an auxiliary learning framework, GeS keeps the beneficial effect provided by self-supervision without the risk of confusing the learning process with low-level jigsaw puzzle specific information. Indeed the auxiliary knowledge is extracted by a dedicated residual block towards the end of the network together with a tailored backpropagation strategy that keeps the primary and the auxiliary tasks synchronized but separated in their specific objectives.

In the DA setting, the auxiliary block of GeS is trained exclusively by the target images, that are available at training time but are unlabeled. In this case when the ordered source images enter the auxiliary block we obtain target-style-based features that allow to bridge the gap across domains. Indeed, these features are recombined with the ones from the main backbone and together guide the learning process of the primary classification model. In DG, the ordered source images are provided as input to the primary classification task, while their scrambled versions are fed to the auxiliary block. Thus the scrambled source images train the auxiliary block, that is finally used as a complementary feature extractor for the respective ordered images.

At test time, for DA each of the ordered target images pass both through the main and through the auxiliary block for feature extraction using the network model obtained at the end of the training phase. In DG, for each target sample we may follow the same procedure used for DA testing. However, we can also do more by leveraging on self-supervision to distill further knowledge.

 target run GeS GeOS GeOS GeOS
 photo 1 96.41 +0.12 +0.18 +0.30
2 96.41 +0.30 +0.48 +0.36
3 96.05 +0.30 +0.30 +0.42
 art_paint. 1 79.00 +0.74 +1.13 +1.37
2 78.71 +0.83 +0.33 +0.09
3 79.15 +0.78 +0.64 +0.64
 cartoon 1 73.72 +0.85 +0.67 +0.79
2 74.23 +0.17 +0.34 +0.39
3 75.13 +0,42 +0.55 +0.51
 sketches 1 73.45 +1.53 +2.04 +2.04
2 74.14 +1.35 +2.16 +1.81
3 74.37 +1.20 +1.83 +2.19
Table 5: PACS-DG accuracy gains of GeOS over GeS when finetuning separately over each target sample at test time with an increasing number of iterations.

GeOS is our full method that exploits the architecture of GeS and runs a fine-tuning procedure at test time for each target sample in the DG setting. The target image is scrambled and provided as input to the auxiliary block which is initialised with the model obtained from the scrambled source images at training time. Although starting from a single instance, the standard data augmentation together with the scrambling procedure provide us with enough samples to fine-tune the auxiliary block. Of course minimizing the jigsaw loss means running multiple SGD iterations for the network parameter updates.

Table 5 extends Table 2 of the paper, showing how subsequent iterations of the auxiliary block optimization process always introduces an improvement with respect to GeS. We executed three runs for each experiment and the results indicate that the advantage is always present in each single experiment and it is not just an effect visible on average. We underline that, although also the method JiGen [jigen] exploits self-supervised knowledge for domain generalization, it does not consider the possibility to adapt the network at test time on each target sample. Indeed, its flat multi-task structure would imply an overall update of the network, while with GeOS we can focus on adapting exclusively the auxiliary knowledge block with a larger benefit on the obtained DG accuracy as shown in Table 1 of the main submission.

References

Supplementary Material

We provide here an extended discussion and further evidences of the advantage introduced by learning to generalize one sample at a time through the proposed auxiliary self-supervised finetuning process. First of all we clarify the difference between our full method named GeOS and its simplified version GeS.

GeS is the architecture we designed for deep learning Generalization by exploiting Self-Supervision. Its structure is depicted in Figure 3 of the paper. Besides the main network backbone that tackles the primary classification task, we introduce an auxiliary block that deals with the self-supervised objective. It provides useful complementary features that are finally recombined with those of the main network improving the robustness of the primary model.

We mostly focused on the jigsaw puzzle self-supervised task, thus our auxiliary data are scrambled version of the original images, recomposed with their own patches in disordered positions. This specific formalization for jigsaw puzzle was recently introduced in [jigen], where the method JiGen learns jointly over the ordered and the shuffled images with a flat multi-task architecture. Although it showed to be effective, this approach substantially disregards the warnings highlighted in [NorooziF16] about the need of avoiding shortcuts that exploit low-level image statistics rather than high-level semantic knowledge while solving a self-supervised task. By fitting the self-supervised knowledge extraction in an auxiliary learning framework, GeS keeps the beneficial effect provided by self-supervision without the risk of confusing the learning process with low-level jigsaw puzzle specific information. Indeed the auxiliary knowledge is extracted by a dedicated residual block towards the end of the network together with a tailored backpropagation strategy that keeps the primary and the auxiliary tasks synchronized but separated in their specific objectives.

In the DA setting, the auxiliary block of GeS is trained exclusively by the target images, that are available at training time but are unlabeled. In this case when the ordered source images enter the auxiliary block we obtain target-style-based features that allow to bridge the gap across domains. Indeed, these features are recombined with the ones from the main backbone and together guide the learning process of the primary classification model. In DG, the ordered source images are provided as input to the primary classification task, while their scrambled versions are fed to the auxiliary block. Thus the scrambled source images train the auxiliary block, that is finally used as a complementary feature extractor for the respective ordered images.

At test time, for DA each of the ordered target images pass both through the main and through the auxiliary block for feature extraction using the network model obtained at the end of the training phase. In DG, for each target sample we may follow the same procedure used for DA testing. However, we can also do more by leveraging on self-supervision to distill further knowledge.

 target run GeS GeOS GeOS GeOS
 photo 1 96.41 +0.12 +0.18 +0.30
2 96.41 +0.30 +0.48 +0.36
3 96.05 +0.30 +0.30 +0.42
 art_paint. 1 79.00 +0.74 +1.13 +1.37
2 78.71 +0.83 +0.33 +0.09
3 79.15 +0.78 +0.64 +0.64
 cartoon 1 73.72 +0.85 +0.67 +0.79
2 74.23 +0.17 +0.34 +0.39
3 75.13 +0,42 +0.55 +0.51
 sketches 1 73.45 +1.53 +2.04 +2.04
2 74.14 +1.35 +2.16 +1.81
3 74.37 +1.20 +1.83 +2.19
Table 5: PACS-DG accuracy gains of GeOS over GeS when finetuning separately over each target sample at test time with an increasing number of iterations.

GeOS is our full method that exploits the architecture of GeS and runs a fine-tuning procedure at test time for each target sample in the DG setting. The target image is scrambled and provided as input to the auxiliary block which is initialised with the model obtained from the scrambled source images at training time. Although starting from a single instance, the standard data augmentation together with the scrambling procedure provide us with enough samples to fine-tune the auxiliary block. Of course minimizing the jigsaw loss means running multiple SGD iterations for the network parameter updates.

Table 5 extends Table 2 of the paper, showing how subsequent iterations of the auxiliary block optimization process always introduces an improvement with respect to GeS. We executed three runs for each experiment and the results indicate that the advantage is always present in each single experiment and it is not just an effect visible on average. We underline that, although also the method JiGen [jigen] exploits self-supervised knowledge for domain generalization, it does not consider the possibility to adapt the network at test time on each target sample. Indeed, its flat multi-task structure would imply an overall update of the network, while with GeOS we can focus on adapting exclusively the auxiliary knowledge block with a larger benefit on the obtained DG accuracy as shown in Table 1 of the main submission.