DeepAI
Log In Sign Up

INDIGO: Intrinsic Multimodality for Domain Generalization

For models to generalize under unseen domains (a.k.a domain generalization), it is crucial to learn feature representations that are domain-agnostic and capture the underlying semantics that makes up an object category. Recent advances towards weakly supervised vision-language models that learn holistic representations from cheap weakly supervised noisy text annotations have shown their ability on semantic understanding by capturing object characteristics that generalize under different domains. However, when multiple source domains are involved, the cost of curating textual annotations for every image in the dataset can blow up several times, depending on their number. This makes the process tedious and infeasible, hindering us from directly using these supervised vision-language approaches to achieve the best generalization on an unseen domain. Motivated from this, we study how multimodal information from existing pre-trained multimodal networks can be leveraged in an "intrinsic" way to make systems generalize under unseen domains. To this end, we propose IntriNsic multimodality for DomaIn GeneralizatiOn (INDIGO), a simple and elegant way of leveraging the intrinsic modality present in these pre-trained multimodal networks along with the visual modality to enhance generalization to unseen domains at test-time. We experiment on several Domain Generalization settings (ClosedDG, OpenDG, and Limited sources) and show state-of-the-art generalization performance on unseen domains. Further, we provide a thorough analysis to develop a holistic understanding of INDIGO.

READ FULL TEXT VIEW PDF
07/15/2021

Context-Conditional Adaptation for Recognizing Unseen Classes in Unseen Domains

Recent progress towards designing models that can generalize to unseen d...
08/28/2020

Learning to Balance Specificity and Invariance for In and Out of Domain Generalization

We introduce Domain-specific Masks for Generalization, a model for impro...
01/23/2021

Hierarchical Domain Invariant Variational Auto-Encoding with weak domain supervision

We address the task of domain generalization, where the goal is to train...
09/06/2021

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization

Multimodal abstractive summarization (MAS) models that summarize videos ...
07/18/2020

Learning from Extrinsic and Intrinsic Supervisions for Domain Generalization

The generalization capability of neural networks across domains is cruci...
03/02/2021

Generalizing to Unseen Domains: A Survey on Domain Generalization

Domain generalization (DG), i.e., out-of-distribution generalization, ha...
07/09/2019

Cross-Domain Generalization of Neural Constituency Parsers

Neural parsers obtain state-of-the-art results on benchmark treebanks fo...

1 Introduction

The underlying assumption that training and test data should comprise identically distributed samples often inhibits the applicability of deep learning models in practical scenarios where such a condition may not hold, including applications such as medical imaging, autonomous driving, robotic manipulation, etc [Dou2019DomainGV, Li2021FewShotDA, DeGrave2020AIFR, Hoyer2021DAFormerIN, Wang2021DomainGF, Yue2019DomainRA]

. Recently, the computer vision community has seen concerted efforts towards defining problem settings

[shu2021open, Li2021ProgressiveDE, ZeroShotDG, Mancini2020TowardsRU, Chandhok2021StructuredLE, Mangla2022COCOACA]

as well as developing deep neural network models

[cha2021swad, rame2021ishr, EoA, Dou2019DomainGV, DGstyle, DG2021] to build systems that can learn from existing data to generalize to an unseen domain. Domain Generalization (DG) refers to the task of learning a model using data from source domains (for e.g. clipart, painting, real world) in order to generalize and predict effectively on an unseen domain (e.g. sketch). Most previous approaches [basicdg1, basicdg2, basicdg3, basicdg4, MTAE, DAFL, condinvariant] that address/aim to tackle the DG problem use different learning paradigms and training strategies to learn domain-agnostic semantic features that represent an object category and can thus extend to unseen domain samples at test-time. Other methods [BalanceSpecInv, BNE, BNE2] have also shown that leveraging domain-specific features along with domain-invariant information can further improve the model’s generalization on unseen domains. More recently, vision transformers (ViTs) [vit, deit] have demonstrated a better ability at recognizing object shapes in less textured data such as paintings [naseer2021intriguing, DGtransformer] which is a desirable trait for making models generalize to unseen domains.

An alternative strategy to address this task can be to look for other sources of information that can help disentangle domain-specific and domain-agnostic characteristics and thereby equip models with the ability to capture general domain-agnostic class-level cues. Recent progress towards weakly supervised vision-language models [Radford2021LearningTV, ALBEF, li2021supervision] have shown their abilities on semantic understanding and triggered the interest in using them for practical use in various settings. These models are learned from weak supervision obtained using noisy web-based automatic label annotations and hashtags. However, these approaches provide a methodology for learning holistic representations from cheap, weakly supervised noisy text annotations that capture class-level semantics of object categories such as shape/content [Radford2021LearningTV]. Such representations can inherently capture object characteristics that generalize to unseen domains. We leverage this potential of vision-language models in this work.

Figure 1: Illustration of our broader idea. In scenarios where we don’t have access to explicit modalities like image captions for source domain data, we leverage the “intrinsic” modality present in pre-trained multimodal networks along with visual modality obtained from image.

Curating semantically dense textual annotations for every image in the dataset can be a daunting task since this requires labor-intensive crowdsourcing pipelines and time. Further, when multiple source domains are involved, the annotation cost can blow up several times depending on their number, making the process tedious and infeasible. This creates a bottleneck and hinders us from directly using supervised vision-language approaches [Radford2021LearningTV, li2021supervision, ALBEF] to achieve the best generalization on unseen domains. Motivated from this, we study how the multimodal information in pre-trained multimodal networks [Radford2021LearningTV, li2021supervision, ALBEF] can be leveraged intrinsically to make systems robust to domain-shift and enhance generalization on unseen domains. We propose IntriNsic multimodality for DomaIn GeneralizatiOn (INDIGO), a simple and elegant way of leveraging the intrinsic modality present in these pre-trained multimodal networks along with the visual modality to enhance generalization to unseen domains at test-time. Figure 1 provides a broader understanding of the proposed idea. To the best of our knowledge, this is the first effort to study how multimodality can be leveraged intrinsically via pre-trained multimodal models to generalize better to unseen domains. Our key contributions are as follows:

  • We propose IntriNsic multimodality for DomaIn GeneralizatiOn (INDIGO), a simple and elegant way of leveraging the intrinsic modality present in pre-trained multimodal networks along with the visual modality in order to generalize better to unseen domains. Besides that, we explore other ways of leveraging the intrinsic modality and introduce three new baselines approaches to achieve the same. We use state-of-the-art vision architectures - vision transformers (ViT) [vit] - to handle the visual modality.

  • We perform comprehensive experiments on standard DG benchmarks - DomainNet and Office-Home and show that INDIGO achieves new state-of-the-art by outperforming prior SOTAs, conventional, and newly introduced baselines. Even on more challenging settings like OpenDG and Limited Sources DG, we show INDIGO consistently outperforms aforementioned baselines.

  • We perform a thorough analysis to characterize the efficacy of INDIGO in leveraging intrinsic and visual modalities obtained from pre-trained multimodal network and vision transformer (ViT), respectively.

2 Related Work

Domain Generalization. The reliance of deep learning models on tailored, task-specific data restricts their applicability which makes it crucial to equip these models with ability to tackle domain-shift at test-time [DGSurvey1, DGSurvey2, datasetbias, bias2]. Domain Generalization (DG) [basicdg1, basicdg2, basicdg3, basicdg4, MTAE, DAFL] aims to develop models that can learn from source domains (where data is abundant) and generalize to unseen novel domains given that they share same label set. Most previous approaches that tackle domain-shift learn a domain-invariant representation through data manipulation [crossgrad, advaug, DGstyle, DGstyle2], learning strategies [basicdg1, basicdg2, basicdg3, basicdg4, MTAE, DAFL, condinvariant], or optimization policies [metadg1, metadg2, metadg3]. Other approaches aim to leverage domain-specific characteristics [BNE, BNE2] or a balance of domain-invariant and domain-specific features to further enhance generalization on unseen domains [BalanceSpecInv, Mangla2022COCOACA]. Recently [shu2021open] extended conventional DG setting to an even more practical setup which allows the class label set to be disjoint for multiple source domains. This enables practical, real-world applications by tackling cases where visual samples for all categories of interest may not be available together for all source domains due to long-tailed distributions or gradual addition of rare novel object categories.
Vision Transformers. The recent advent of attention based transformer architectures for computer vision tasks [dosovitskiy2020vit, steiner2021augreg, chen2021outperform, Touvron2021TrainingDI] has motivated several efforts to study their application for object recognition [dosovitskiy2020vit, Touvron2021TrainingDI], detection [Carion2020EndtoEndOD, Zhu2021DeformableDD] and segmentation [xie2021segformer, Hoyer2021DAFormerIN]. The success of these models in practical applications can be attributed to the self-attention mechanism that allows them to attend to a sequence of image patches and effectively learn global interactions better than the conv. counterparts [Khan2022TransformersIV, naseer2021intriguing]. Further, these models require minimal inductive bias by design which enables them to effectively model complex functions and capture relationships from large scale datasets [Khan2022TransformersIV]. These salient features allow vision transformers to perform exceptionally for computer vision tasks and better tackle nuances like occlusions, adversarial perturbations[naseer2021intriguing, Naseer2021OnIA]. The capacity of transformer architectures to learn from large scale pre-training and their ability to capture long range content dependent interactions has lead to progress towards utilising these models for processing multiple modalities for vision tasks like object detection [Gupta2021TowardsGP, Kamath2021MDETRM, Maaz2021Multimodal] and classification [Radford2021LearningTV, ALBEF, li2021supervision].
Multimodal Learning. Multimodal learning aims to build models that can combine and process information from multiple modalities like image and text. Most vision-language models use cross-modal transformers to fuse and align the information between text and image, such as LXMERT [tan2019lxmert], UNITER [chen2020uniter], ViLBERT [lu2019vilbert], VinVL [zhang2021vinvl], OSCAR [li2020oscar]. Other works like ICMLM [sariyildiz2020icmlm] and VirTex [Desai2021VirTexLV]

have shown that language supervision on COCO Captions can also produce useful visual representations.


However, contrastive vision-language pre-training (CLIP) [Radford2021LearningTV] recently gained much attention because of its simplicity, scale, and strong results. It proposes a simple pretraining task of predicting which caption goes to which image through a image-text contrastive supervision and demonstrates that the image representations obtained are transferable to several downstream tasks like classification [Radford2021LearningTV]

, image retrieval

[Luo2021CLIP4Clip], object detection [zhou2021detecting], image-synthesis [styleCLIP], video understanding [videoclip], 3D recognition [PointCLIP], etc. These results have garnered the attention and focus of the vision-language community to develop models using such contrastive objectives. DeCLIP [li2021supervision] employs additional self-, multi-view, nearest-neighbor supervision along with image-text contrastive supervision to match the performance of CLIP but with 7.1x lesser data. ALBEF [ALBEF] uses momentum distillation, a self-training method to learn from pseudo-targets produced by a momentum model. SLIP [mu2021slip]

introduces a multi-task learning framework for combining self-supervised learning and CLIP pre-training. ALIGN

[align], uses a larger but noisier uncurated dataset and shows similar results.
In this work, we leverage the intrinsic modality present in such contrastive vision-language multimodal networks as we believe their contrastive learning objective ensures that semantically similar classes representation should cluster together and different should cluster apart, allowing them to implicitly learn to focus on discriminative class-specific semantic cues of a given object category.

3 Intrinsic Multimodality for DG

3.1 Background

Domain Generalization (DG). The goal of DG is to learn a model using data from source domains such that it generalizes to an unseen target domain. Let denote the training set, where is an image in the visual space () with corresponding class label from a set of known class labels and domain label from a set of source domains . The test set is denoted by where represents the unseen target domain. We aim to learn a model that captures the mapping from such that it is trained using , but can predict class label for an sampled from an unseen domain in .

Vision Transformers (ViTs). A ViT [vit] is composed of a sequence of blocks where each block contains multi-headed self-attention (MSA) with a feedforward network (FFN) and layer normalization (LN). An input image, , is first converted into a sequence of patch tokens, , by dividing it with a specific patch size followed by a linear projection. Next, an additional classification (CLS) token, , is added to the sequence, followed by adding positional embedding to each token to provide positional information. All tokens are then passed through stacked transformer blocks. The CLS

token interacts with all patch tokens and summarizes them in a single embedding vector for final classification. The processing for

transformer block can be summarized as:

(1)

3.2 Motivation: Multimodal networks generalize better to different domains

Most methods that tackle the domain-shift problem devise learning strategies that capture domain-agnostic features. Such methods work based on the assumption that a domain-invariant manifold exists where the object images lie irrespective of the domain in which they are represented. For e.g class-level semantic cues such as long neck, long legs, has spots are stable characteristic features that define the class giraffe. Hence, it is crucial to design models that focus on underlying semantic features that make up an object category and are robust to domain variations.

Text can describe images with syntactically and semantically meaningful sentences, offering a better way to summarize their content than one-hot or soft-label vectors. Vision-language models like CLIP [Radford2021LearningTV] which are trained on noisy weakly aligned image-text pairs with minimal supervision are better at understanding image content in different domains as compared to other vision models (Resnet-50 and ViT-S [naseer2021intriguing]

) trained on ImageNet-1K/-21K

[deng2009imagenet]

, and Stylized-ImageNet

[stylized-imagenet](Figure 2). Since these models are trained with a contrastive learning objective that implicitly encodes information about inter-class relationships, they develop the ability to focus on class-specific semantic cues (rather than texture) that help them generalize to domain shifts.

Figure 2: Generalization to different domains. Performance of various pre-trained models on different domains of Office-Home dataset. Results are averaged over 5 runs. Vision-language model, Resnet-50 (CLIP) can been seen to outperform others at generalizing to different domains. IN: ImageNet-1K, IN-21K: ImageNet-21K, SIN: Stylized ImageNet.

However, annotating every image with captions in source domains can be a daunting task because of time and labor. Motivated from this, we propose IntriNsic multimodality for DomaIn GeneralizatiOn (INDIGO) that exploit the large-scale pre-trained vision-language models [Desai2021VirTexLV, li2021supervision, ALBEF], by integrating the intrinsic modality present in their representations with the visual modality obtained from a vision transformer trained on source domains.

3.3 INDIGO: Leveraging intrinsic modality present in MViTs

As depicted in Figure 3a, there are three main components in our approach: (1) a multimodal branch which consists of a multimodal vision transformer (MViT) pre-trained on image-text pairs used to extract the intrinsic modality present in it; (2) a visual branch, which trains a vision transformer (ViT) to extract visual modality that will encode meaningful shape-biased concepts from the source domains, useful for generalization; and (3) a fusion module which combines best of both - intrinsic and visual modality through a multi-headed self-attention mechanism for final classification.

Figure 3: (a) (Proposed approach) INDIGO consists of a multimodal branch comprised of pre-trained MViT to obtain intrinsic modality , a visual branch to extract visual modality and a fusion module to combine both; (New Baselines) (b) Distillation considers MViT as teacher and distills a ViT with a soft distillation loss via DIST token; (c) Early Fusion fuses the intrinsic modality via DIST token in the input layer of the visual branch itself; (d) Cross-Attention uses a CrossViT [chen2021crossvit] to cross-attend MViT features with ViT features.

Multimodal branch. We leverage pre-trained large-scale vision-language networks like CLIP [Radford2021LearningTV], DeCLIP [li2021supervision], ALBEF [ALBEF] that use a contrastive objective to push the embeddings of matched image-text pairs together and non-matched pairs apart. The pipeline generally consists of an image encoder (in our case a ViT which we call MViT), a text encoder , and linear projection layers and . The image and text features (obtained from their respective encoders) are projected to the same dimension, normalized, and then aligned using the following contrastive loss:

here , denote the image-text pair in a batch of size . represents the MViT’s representation corresponding to the CLS token. The similarity function sim(,) is measured by dot product, and

is a learnable temperature variable to scale the logits.

In scenarios where we do not have direct access to text annotations, we can assume that an image’s unnormalized projected embedding would be weakly aligned with its hypothetical text description. This allows us to leverage the intrinsic modality present in a pre-trained multimodal vision transformer. Hence, we propose to use this unnormalized projected embedding as a “intrinsic” modality in our overall pipeline.

Visual branch. The visual branch is a sibling to the multimodal branch. We employ a trainable vision transformer, , to learn visual concepts from source domains that might be absent in MViT representations but are relevant to the task. These concepts can be dataset-, domain-, or even class-specific, which, when combined with the “intrinsic” modality, can help boost the overall performance on the given task. Moreover, by design, since ViTs are better than CNNs in recognizing object shapes [naseer2021intriguing, DGtransformer], we believe their shape-biased representations will further assist our overall pipeline in generalizing to unseen domains (as we show through our experiments).
Fusion module. The purpose of the fusion module is to fuse the “intrinsic” modality (obtained from the multimodal branch) and the visual modality (obtained from the visual branch) to perform the final classification. We first project both of them to same space via linear projections and to obtain intrinsic modality and visual modality tokens. This is followed by a series of multi-headed self-attention blocks (MSA) and feed-forward networks (FFN) to perform inter-modality attention on both tokens as follows

(2)

The attention mechanism allows the intrinsic modality token to attend with the visual modality token and incorporate any dataset, domain, or class-specific concepts present in it. Similarly, the visual modality token will interact with the intrinsic modality token to learn multimodal concepts present in it. This ensures that final representations leverage the best of both modalities. Finally, the transformed representation of intrinsic modality () is passed through a linear layer to get class predictions and minimize cross-entropy loss. In addition to this, we add a regularizer that also minimizes classification loss on the transformed representation of visual modality token (by passing through another linear layer ). Overall loss can be written as follows

(3)

where

is the regularization hyperparameter. In scenarios like OpenDG, where each source domain holds disparate label sets, chances of learned representations becoming domain biased are high. Rather than minimizing Equation

3, we minimize the following semantic alignment loss

(4)

where and are semantic projection layers, is a learnable temperature variable to scale the logits, is the text prompt for class i.e "a photo of {class}", and is the text encoder of the pre-trained multimodal network. Enforcing that images align with their corresponding class prompts ensures that the representations do not get biased towards domains and capture domain-agnostic class-specific semantics described via class prompts.

3.4 Baselines: Other approaches for leveraging intrinsic modality

Besides proposing INDIGO, we also explore other ways to leverage the intrinsic modality obtained from the multimodal branch and combine it with the visual modality extracted from the visual branch. We introduce three baseline approaches (variations to our proposed approach) to achieve the same.

Logit Distillation. As illustrated in Figure 3b, we consider the pre-trained MViT as a teacher network and use a soft-distillation strategy to distill intrinsic modality present in it via an additional distillation (DIST) token similar to DeiT [deit].

Early Fusion. Instead of using DIST token for performing logit distillation, we can use it to fuse the intrinsic modality in the input layer of the visual branch itself. This is illustrated in Figure 3c. The CLS token can now interact with both - intrinsic modality (provided via DIST token) and patch tokens to summarize the information present in them for final classification.

Cross-Attention. We can cross-attend features (corresponding to image patches and CLS token) extracted from vision-language model with image patch embeddingss and CLS token . For this purpose, we now employ a Cross-attention Vision Transformer (CrossViT) [chen2021crossvit] rather than vanilla ViT [vit] in the visual branch. This is illustrated in Figure 3d.

4 Experiments and Analysis

Closed Domain Generalization We first perform experiments on the following domain generalization datasets under closed setting - (1) DomainNet [domainnet], a large scale dataset containing 586,575 examples from 345 classes and six domains (clipart, infograph, painting, quickdraw, real, sketch); and (2) Office-Home [office-home], containing 15,588 examples from 65 classes and four domains (art, clipart, product, real).
(Baselines) We evaluate and compare four kinds of training pipelines - (1) CNNs, which include state-of-the-arts [cha2021swad, rame2021ishr, EoA] that use a Resnet-50 backbone; (2) ViTs, which include DeiT-S [deit]

(considered equivalent to Resnet-50) backbone trained in AGG manner; (3) MViTs, which include conventional ways (like zero-shot inference, transfer learning using linear layer and attention layers) of using the pre-trained MViT; and (4) MViTs + ViTs, that include our newly introduced fusion baselines (distillation, early fusion, and cross attention) and our proposed fusion, INDIGO (all using a DeiT-S visual backbone). Implementation and architectural details of fusion module are described in the supplementary.


(Training and evaluation protocol) Following previous works [cha2021swad, rame2021ishr, domainbed], we consider each domain as the target domain and the rest domains as source domains for training. We use test-domain validation (reporting best performance on test set) and training-domain validation model selection criteria (using a validation set) for DomainNet and Office-Home, respectively, as described in [domainbed].
(Results) Table 1 presents our results when CLIP-ViT-B/16 [Radford2021LearningTV] is used as an MViT in the multimodal branch. As we can see, INDIGO achieves new state-of-the-art results by outperforming all the compared methods by good margins. In particular, on challenging domains like quickdraw where conventional ways of using MViTs perform worse than prior arts, INDIGO achieves the best performance by leveraging the best of both - intrinsic and the visual modality. Further, we can observe that ViTs trained with simple vanilla AGG loss easily beat state-of-the-art CNN-based approaches - SWAD [cha2021swad], EoA [EoA]. This shows that their design offers shape-biased representations (compared to CNNs), which INDIGO leverages. Amongst our newly proposed baselines, Early Fusion stands out as the best competition.

Type Method DomainNet Office-Home
C S P Q I Avg. R C P A Avg.
CNNs AGG 58.4 49.9 47.3 13.4 19.8 37.76 77.3 53.4 76.5 62.7 67.47
IRM [arjovsky2019invariant] 51.0 44.7 38.8 11.8 16.7 32.6 77.2 52.3 75.2 61.8 66.63
DRO [DRO] 47.8 40.7 36.3 9.0 17.2 30.2 77.7 52.9 75.5 61.6 66.93
Mixup [zhang2018mixup] 55.8 49.2 46.2 12.8 19.2 36.64 79.2 54.7 77.3 64.7 68.98
MLDG [metadg1] 59.3 51.2 48.8 14.0 20.3 38.72 78.6 54.5 75.9 63.7 68.18
CORAL [dcoral] 58.8 50.8 47.5 13.6 20.8 38.3 77.9 55.3 76.7 64.4 68.58
MMD [DAFL] 54.6 47.5 44.9 12.6 19.6 35.84 78.1 53.7 76.1 63.0 67.73
DANN [ganin2016domain] 53.8 46.7 43.5 11.8 17.5 34.66 76.6 51.7 74.1 59.3 65.43
C-DANN [CDANN] 53.4 46.5 44.7 12.9 18.4 35.18 76.0 51.1 74.1 61.0 65.55
EoA [EoA] 65.9 57.1 55.3 16.5 23.4 43.64 81.5 59.8 79.5 69.1 72.48
SelfReg [kim2021selfreg] 62.4 53.7 51.7 14.7 22.5 41.0 78.8 55.4 78.4 64.9 69.37
SagNet [sagnets] 57.5 49.5 46.3 13.5 19.2 37.2 78.3 54.8 75.8 63.4 68.08
ARM [arm] 49.6 43.9 41.5 10.8 16.5 32.46 75.2 51.0 74.1 58.9 64.8
V-REx [Vrex] 43.3 37.7 32.5 9.8 14.1 27.48 76.6 53.0 75.3 60.7 66.4
MTL [MTL] 58.0 49.0 46.2 12.7 19.2 37.02 76.8 52.4 74.9 61.5 66.4
SAND [SAND_mask] 43.8 39.9 38.2 9.0 15.2 29.22 76.2 53.3 73.5 60.3 65.82
RSC [huangRSC2020] 55.5 47.8 44.4 12.5 18.3 35.7 75.1 51.4 74.8 60.7 65.50
Fishr [rame2021ishr] 58.3 50.5 47.9 13.6 20.2 38.1 78.3 54.4 76.2 62.4 67.83

SWAD [cha2021swad] 66.0 55.5 53.5 16.1 22.4 42.7 80.2 57.7 78.4 66.1 70.6
ViTs AGG 69.14 54.25 58.15 14.83 27.55 44.78 84.64 60.10 84.43 74.2 75.84

MViTs
Zero-Shot 67.8 61.79 64.13 13.9 45.7 50.66 84.7 60.8 83.37 78.9 76.94

Linear Eval 63.2 59.37 57.36 10.34 41.7 46.39 82.51 66.66 81.22 72.86 75.81
Attention Eval 75.3 64.68 64.33 16.30 44.23 52.97 88.14 69.00 88.99 77.53 80.92
MViTs + ViTs Distillation 65.23 52.29 55.55 14.06 25.8 42.59 85.08 59.56 83.92 74.04 75.65

Cross Attention 75.14 63.75 64.16 15.80 39.01 51.57 86.67 71.56 88.66 74.20 80.27
Early Fusion 76.75 64.6 65.35 17.1 41.86 53.13 88.76 68.86 88.33 78.68 81.16
INDIGO 76.9 65.65 66.42 17.4 46.32 54.54 89.38 73.31 90.78 79.92 83.35
Table 1: ClosedDG results. Performance of INDIGO on DomainNet (C: clipart, S: sketch, P: painting, Q: quickdraw, I: infograph) and Office-Home (R: real world, C: clipart, P: product, A: art) datasets under closed setting. We highlight the best results and the second best results. The results are averaged over five runs. INDIGO achieves new state-of-the-art by outperforming all compared methods by good margins.

Open Domain Generalization Shu et al. [shu2021open] introduce OpenDG, a challenging domain generalization setting where each source domain holds disparate label sets. Since different label sets of distinct source domains cause some classes to be present in more domains than other classes, minor classes’ data in a few domains lack diversity. This makes the problem extremely difficult by biasing model representations towards domains than content. Hence, we next evaluate the performance of INDIGO on Office-Home [office-home] and PACS [pacs] datasets under an open setting.

Type Method Office-Home PACS
R C P A Avg. P A C S Avg.
CNNs AGG 62.4 42.83 54.27 42.22 50.43 53.15 51.35 66.43 49.75 55.17
MLDG [metadg1] 62.98 41.82 56.89 42.58 51.07 62.20 44.59 71.64 51.29 45.00
FC [Li2019ICML_FC] 63.79 41.80 54.41 44.13 51.03 60.94 51.12 69.32 51.15 58.13
Epi-FCR [li2019episodic] 62.60 37.13 54.95 46.33 50.25 46.35 54.16 72.00 46.35 60.64
PAR [PAR_dg] 65.98 41.27 55.37 42.40 51.26 51.86 52.97 62.77 53.62 56.56
RSC [huangRSC2020] 60.85 38.60 54.61 44.19 49.56 67.53 50.47 67.51 50.17 58.92
CuMix [mancini2020towards] 64.63 41.54 57.74 42.76 51.67 65.67 53.85 74.16 37.70 57.85
DAML [shu2021open] 65.99 45.13 61.54 53.13 56.45 75.69 43.02 73.65 58.50 65.49
ViTs AGG 76.71 53.76 67.39 65.35 65.80 59.55 63.70 52.15 34.12 52.38
MViTs Zero-Shot 81.2 63.69 82.33 70.8 74.5 99.99 97.87 99.53 87.34 96.18
Linear Eval 52.97 48.07 50.32 47.65 49.75 79.26 82.71 82.29 72.76 79.26
Attention Eval 75.41 72.75 62.83 63.08 68.52 76.92 78.23 80.48 78.17 78.45

MViTs + ViTs
Distillation 79.52 63.69 56.37 67.08 66.66 65.10 59.16 53.72 38.52 54.12
Early Fusion 75.77 59.03 70.44 65.89 67.78 77.35 68.59 74.67 61.89 70.62
Cross attention 76.13 67.86 74.27 64.22 70.62 76.52 77.44 87.88 83.29 81.28

INDIGO 83.23 73.25 83.51 67.68 76.91 93.44 93.61 91.08 90.45 92.14
Table 2: OpenDG results. Performance of INDIGO on Office-Home (R: real world, C: clipart, P: product, A: art) and PACS (P: photo, A: art, C: cartoon, S: sketch) datasets under open setting. We highlight the best results and the second best results. The results are averaged over five runs. INDIGO consistently outperforms all compared methods especially on challenging domains like sketch.

(Baselines) Similar to previous setting, we compare all four kinds of pipelines - (1) CNNs, which includes prior arts and current state-of-the-art, DAML [shu2021open] that uses three Resnet-18 backbones (comparable to Resnet-50); (2) ViTs, which include DeiT-S [deit] trained in AGG manner; (3) MViTs, which include conventional ways of using the pre-trained MViT; and (4) MViTs + ViTs, that include our newly introduced fusion baselines and our proposed fusion, INDIGO (all using a DeiT-S visual backbone). Implementation and architectural details of fusion module are described in the supplementary.
(Training and evaluation protocol) Similar to DAML [shu2021open], we consider each domain as the target domain and the rest domains as source domains for training. We use training-domain validation model selection criteria for both datasets. We report the accuracy of target domain data from non-open classes as in [shu2021open].
(Results) Table 2 presents our results when CLIP-ViT-B/16 [Radford2021LearningTV] is used as an MViT in the multimodal branch. It can be seen that even in challenging settings like OpenDG, where there is a high chance of model representations becoming domain biased, INDIGO achieves state-of-the-art results on the Office-Home dataset. On PACS, even though zero-shot inference works best on average, INDIGO still performs best on challenging sketch domain (on which all other methods perform worse). Since PACS (under open setting) is a relatively smaller and less-complex (having only six non-open classes) dataset than Office-Home (having 54 non-open classes), we believe it led to overfitting/memorization of the source domain data. This can also be seen with ViTs (trained with vanilla loss), which significantly outperforms state-of-the-art approach DAML [shu2021open] on Office-Home but overfits on PACS.
Choice of MViT. Apart from CLIP, we also experiment with two other pre-trained MViTs - DeCLIP [li2021supervision] and ALBEF [ALBEF]. DeCLIP uses additional self, multi-view, and nearest-neighbor supervision along with image-text contrastive supervision to achieve similar performance as CLIP but with 7.1 x fewer data. ALBEF, on the other hand, uses momentum distillation, a self-training method to learn from pseudo-targets produced by a momentum model. As shown in Table 3, INDIGO still outperforms conventional and our newly introduced baselines with good margins on the Office-Home dataset under both closed and open settings. This highlights the efficacy of INDIGO when other pre-trained multimodal networks are used in the multimodal branch. Overall, CLIP, when used in INDIGO, performs best.

Multimodal Method Closed Office-Home Open Office-Home
R C P A Avg. R C P A Avg.
CLIP Zero-Shot 84.7 60.8 83.37 78.9 76.94 81.2 63.69 82.33 70.8 74.5
Linear Eval 82.51 66.66 81.22 72.86 75.81 52.97 48.07 50.32 47.65 49.75
Attention Eval 88.14 69.00 88.99 77.53 80.92 75.41 72.75 62.83 63.08 68.52

Cross-attention 86.67 71.56 88.66 74.20 80.27 76.13 67.86 74.27 64.22 70.62
Early Fusion 88.76 68.86 88.33 78.68 81.16 75.77 59.03 70.44 65.89 67.78
INDIGO 89.38 73.31 90.78 79.92 83.35 83.23 73.25 83.51 67.68 76.91
DeCLIP Linear Eval 34.74 37.92 41.49 33.1 36.8 15.37 8.48 14.87 8.28 11.75
Attention Eval 86.36 70.53 88.45 70.16 78.87 74.51 66.69 70.55 60.10 67.96
Cross-attention 79.48 65.14 81.34 63.20 72.29 56.68 55.57 54.18 46.57 53.25
Early Fusion 87.68 67.23 88.41 73.57 79.22 70.51 59.65 67.90 59.92 64.50
INDIGO 88.61 73.28 90.45 73.05 81.35 83.17 69.70 76.41 62.61 72.97
ALBEF Linear Eval 84.20 70.74 83.19 77.32 78.86 69.70 60.04 65.31 64.04 64.77
Attention Eval 85.86 69.40 86.30 75.85 79.35 74.61 66.45 71.68 62.0 68.68
Cross-attention 86.29 71.15 86.52 73.04 71.75 74.64 65.71 67.5 62.61 67.61
Early Fusion 87.33 69.49 87.27 77.77 80.46 76.55 64.03 73.11 66.48 70.04
INDIGO 87.52 73.42 87.46 78.68 81.77 82.59 71.47 77.79 68.39 75.06
Table 3: Results with other MViTs. Performance of INDIGO with different MViTs on Office-Home (R: real world, C: clipart, P: product, A: art) under closed and open setting. We highlight the best results and the second best results. The results are averaged over five runs. INDIGO consistently achieves best results compared to conventional and newly introduced baseline approaches.

Choice of visual network and number of layers in fusion module. To highlight that INDIGO is leveraging the visual modality, we perform an ablation where we vary the strength of ViT used in the visual branch. Additionally, we also vary the number of layers used in the fusion module to show its effect on final performance. As shown in Table 4, by using more powerful (Hybrid ViTs) and large (ViT-B) vision transformers [vit] in the visual branch, the domain generalization performance of INDIGO improves. This shows that INDIGO can attend to visual modality to learn additional shape-biased concepts, and the performance is not solely because of intrinsic modality. The gain in performance becomes prominent when more layers are used in the fusion module, implying a better inter-modality interaction between intrinsic and visual modality tokens.

Backbone 3 Layers 12 Layers
R C P A Avg. R C P A Avg.
Resnet-50 88.8 72.91 90.1 79.34 82.78 89.0 72.48 90.2 78.2 82.47
DeiT-Ti 89.02 72.88 90.14 80.05 83.02 89.34 73.52 90.43 79.31 83.15
Hybrid-ViT-Ti 89.1 72.77 90.3 79.94 83.02 89.7 73.82 90.50 79.9 83.48
DeiT-S 89.38 73.31 90.78 79.92 83.35 90.1 74.32 90.99 80.2 83.90
Hybrid-ViT-S 89.73 73.71 91.05 81.16 83.91 90.88 75.45 91.22 81.62 84.80
ViT-B 91.4 74.23 91.84 82.33 84.95 91.76 75.85 92.13 83.51 85.81
Table 4: Ablation on choice of visual network and number of layers in fusion module. Performance of INDIGO when different networks are used in visual branch and layers of fusion module are increased on Office-Home (R: real world, C: clipart, P: product, A: art) under closed setting. The results are averaged over five runs. Stronger and Larger ViTs can be seen to further improve the generalization of INDIGO to unseen domains.

Choice of fusion mechanism. We analyze how different fusion mechanisms perform as compared to multi-head self-attention modules (MSA) [vit], which we currently use in our fusion module. In Table 6, we compare - (1) naive fusion mechanism like concatenation (which just concatenates the intrinsic and visual modality tokens), (2) multi-head self-attention (MSA) [vit], (3) multi-head cross-attention (MCA) [chen2021crossvit], and (4) MLP-mixer [tolstikhin2021mlpmixer]. We observe that MCA performs slightly better than MSA, whereas MLP-mixer and concatenation perform inferior. This shows that the choice of fusion mechanism can affect the overall performance, which we leave for future works to explore.
Can fine-tuning MViT help further? In all our previous experiments, we used a frozen pre-trained multimodal network. As an additional experiment, along with training the visual branch and fusion module, we also finetune the multimodal network (i.e CLIP) . We finetune in two ways - (1) only normalization layers; and (2) last layer. Table 6 shows that the performance of INDIGO further improves while still outperforming standalone finetuning of CLIP.

Backbone Closed OfficeHome
R C P A Avg.
CLIP (FT B.N Layers) 84.00 68.92 84.13 76.47 78.38
INDIGO (FT B.N Layers) 89.50 73.50 91.02 80.2 83.55
CLIP (FT Last Layers) 89.6 74.5 89.70 83.43 84.30
INDIGO (FT Last Layer) 90.83 76.87 91.64 83.91 85.81
Table 6: Ablation on choice of fusion mechanism. Performance of INDIGO when different fusion mechanisms are used in fusion module on Office-Home (R: real world, C: clipart, P: product, A: art) under closed setting. The results are averaged over five runs.
Backbone Closed OfficeHome
R C P A Avg.
Concatenation 83.37 60.78 84.79 75.06 76.0
MSA 89.38 73.31 90.78 79.92 83.35
MCA 89.87 73.30 90.71 80.03 83.47
MLP-Mixer 88.93 72.52 90.23 78.75 82.60
Table 5: Ablation on finetuning the MViT. Performance of INDIGO when MViT is also finetuned on Office-Home (R: real world, C: clipart, P: product, A: art) under closed setting. We use training-domain validation set model selection criteria. The results are averaged over five runs.

t-SNE plots. We analyze and compare the representations learned by INDIGO with DeiT-S [deit] and CLIP [Radford2021LearningTV] on target domain (for 25 classes of Office-Home) via t-SNE plots in Figure 4. As can be seen, for INDIGO, the plot is less noisy and well segregated into class clusters as compared to DeiT-S and CLIP, resulting in state-of-the-art generalization on these target domains.

(a)
(b)
Figure 4: (a) t-SNE plots. t-SNE visualization of learned feature representations by DeiT-S (standard AGG training), CLIP and our proposed INDIGO method when clipart and art are chosen as target domains for Office-Home dataset, (b) Limited sources DG. Performance of INDGIO when trained only on real world as source domain and evaluated on clipart, art and product as unseen target domains for Office-Home dataset. (Best viewed in color, zoomed in)

DG with limited sources. Generalization to unseen domains can become challenging when data from only a few source domains is available. To test INDIGO under such a challenging scenario, we experiment on Office-Home with real world as the only source domain and rest domains (real world, clipart, product) as target domains. In Figure 4, we can observe that INDIGO still results in the best average performance when compared with zero-shot CLIP, attention eval (on frozen CLIP features), and Early Fusion baselines.
Attention maps. Similar to [naseer2021intriguing], in Figure 5, we analyze and compare attention maps of a DeiT-S trained in vanilla (AGG) fashion with the one used in INDIGO’s visual branch. We observe that, when used in INDIGO pipeline, the vision transformer can concentrate on foreground objects in the scene and better ignore the background or style. This confirms that attending with intrinsic modality in the fusion module helps the visual transformer exhibit more shape-bias than a vanilla one. The experiment is performed on DomainNet with quickdraw, sketch and clipart as target domains separately.

Figure 5: Attention maps for DeiT-S on harder quickdraw, sketch domains and relatively simpler clipart domain trained with Standard (AGG) training and using our proposed INDIGO method. Best attention heads depicted for both approaches.

5 Conclusions

In this work, we study how multimodal information present in pre-trained vision-language models can be leveraged “intrinsically” to build systems that generalize to unseen domains. We propose INDIGO, a simple and elegant way to combine the intrinsic and visual modalities obtained from pre-trained multimodal network and vision transformer (ViT), respectively. We conduct extensive experiments to demonstrate the effectiveness of the approach in generalizing to unseen domains under closed, open, and limited sources settings. We then conduct a thorough analysis to characterize the efficacy of our approach in leveraging both intrinsic and visual modalities. Our future work will include the development of better methods to effectively fuse both modalities to improve generalization performance in unseen domains further. We also plan to extend and explore the significance of our work in other challenging settings like OOD generalization, data-free domain generalization, zero-shot domain generalization, domain generalized semantic segmentation, and visual grounding.

References