Log In Sign Up

Multi-modal Alignment using Representation Codebook

Aligning signals from different modalities is an important step in vision-language representation learning as it affects the performance of later stages such as cross-modality fusion. Since image and text typically reside in different regions of the feature space, directly aligning them at instance level is challenging especially when features are still evolving during training. In this paper, we propose to align at a higher and more stable level using cluster representation. Specifically, we treat image and text as two "views" of the same entity, and encode them into a joint vision-language coding space spanned by a dictionary of cluster centers (codebook). We contrast positive and negative samples via their cluster assignments while simultaneously optimizing the cluster centers. To further smooth out the learning process, we adopt a teacher-student distillation paradigm, where the momentum teacher of one view guides the student learning of the other. We evaluated our approach on common vision language benchmarks and obtain new SoTA on zero-shot cross modality retrieval while being competitive on various other transfer tasks.


Unifying Vision-Language Representation Space with Single-tower Transformer

Contrastive learning is a form of distance learning that aims to learn i...

Masked Vision and Language Modeling for Multi-modal Representation Learning

In this paper, we study how to use masked signal modeling in vision and ...

Analysis of Joint Speech-Text Embeddings for Semantic Matching

Embeddings play an important role in many recent end-to-end solutions fo...

KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation

Self-supervised vision-and-language pretraining (VLP) aims to learn tran...

DeMIAN: Deep Modality Invariant Adversarial Network

Obtaining common representations from different modalities is important ...

Twin Contrastive Learning for Online Clustering

This paper proposes to perform online clustering by conducting twin cont...

1 Introduction

Vision language (V&L) representation learning is the problem of learning a unified feature embedding using both image and text signals. Pretrained V&L models have a great diversity of applications in various downstream tasks across different settings, e.g. via transfer learning 

[chen2020uniter, li2020oscar, zhang2021vinvl]. The main tasks in V&L pretraining include aligning the feature spaces of different modalities (multi-modal alignment  [lu2019vilbert, chen2020uniter, li2020oscar, li2021align]) and capturing the interaction across modalities (cross-modal fusion,  [vaswani2017attention, dosovitskiy2020image]). Late fusion approaches such as CLIP[radford2021learning] and ALIGN [jia2021scaling] focused on the first task, while early fusion approaches such as OSCAR [li2020oscar], VinVL [zhang2021vinvl] and VilLT[kim2021vilt] focused on the second one. In this work, we adopt a hybrid approach similar to ALBEF [li2021align], where features from image and text modalities were first aligned and then fused using a transformer encoder. The main focus of our work is on the feature alignment stage, which is challenging due to the fact that image and text inputs have very different characteristics. Existing approaches such as CLIP [radford2021learning] and ALIGN [jia2021scaling] have to rely on large training resources and on massive amount of data to obtain good alignments (400M and 1.8B image-text pairs respectively).

Figure 1: We propose to use a learnable codebook to better align the image and text modalities. The codebook serves as a “bridge” between the image and text features. Each codeword can be interpreted as a prototype, which enables contrasting image and text at the cluster level. We then solve an optimal transport [ambrosio2008gradient]

problem to optimize the distance between each modality to the prototypes, which in turn optimizes the alignment between the two modalities. Prototype vectors are learned along with the feature encoders in our V&L framework.

In this work, we propose a more efficient alignment strategy by using a codebook that quantizes the common text-image feature space into codewords. These codewords or cluster centers provide a more stable means for contrastive reasoning compared to individual text or visual features. We took the inspiration from SwAV[caron2020unsupervised], which was developed for self-supervised visual representation learning. In [caron2020unsupervised]

, two augmented versions (views) of the same input image were passed through a deep network for feature extraction. Visual embedding was learned by optimizing an objective function that enforces the consistency between the feature from one view and the assigned cluster from the other view. SwAV achieved impressive performance in various transfer tasks (see

[caron2020unsupervised]). Here, we carried out contrastive reasoning across modalities (image-text) instead of cross image views. Details are in Section 3.1, but in a nutshell, we use a learnable codebook for both image and text modalities and train our model to predict the codeword assignment using either text or visual information. Effectively, visual and text features are lined up via aligning with the common codewords during training. See Figure 1 for an illustration.

The codebook can be considered as a quantized sample of the underlying output feature distribution. It is end-to-end learnable together with the model parameters. To avoid abrupt changes during training, we further employ momentum distillation, which has been widely used in previous self-supervised learning works such as BYOL 

[grill2020bootstrap], DINO [caron2021emerging], MoCo[he2020momentum]. In brief, similar to ALBEF [li2021align], for each of the image, text and fusion encoders, there is a corresponding encoder that is updated through moving average without gradient back propagation. These momentum encoders serve as teachers to guide the self-supervised learning process. Different from ALBEF [li2021align], we use the teachers to guide codebook learning as well as for the cross-modal and intra-modal alignment.

The above two components are wired up to support the stable update of the codebook which, in turn, provides an efficient regularization mean for cross modality alignment. Experiment results (Section 4) show that our approach is competitive with state of the art across various benchmarks even when comparing with approach that use massive amount of data such as CLIP [radford2021learning] and ALIGN [jia2021scaling]. In summary, our main contributions are as follows,

  • We propose a codebook-based approach for efficient vision-language alignment learning. It is an extension from self-supervised vision representation learning (SSL) to the multimodal setting.

  • We introduce a new distillation algorithm that helps unimodal and crossmodal contrastive optimization as well as helps stablize codebook learning.

The rest of the paper is organized as follows. We introduce related work to ours in Section 2. In Section 3, we describe our framework, called Codebook Learning with Distillation (CODIS), and its two components, multimodal codebook learning and teacher-student distillation. Experimental results are presented in Section 4. Section 5 concludes the paper.

2 Related Work

Vision-Language Pre-training (V&L) V&L pretraining is an active research area with many recent works. We review here the works that are most relevant to ours. Architecture wise, previous approaches can be broadly clasified into two categories early fusion and late fusion. In early-fusion approaches  [su2019vl, kim2021vilt, chen2020uniter, li2020oscar], image and text are transformed into sequences (tokenization) and passed to a single encoder (typically Transformer-based) for embedding generation. Thus multimodal signals are fused in the early stage. Whereas in late-fusion works  [radford2021learning, jia2021scaling], separate encoders are used for image and text. Extracted features are typically fused during the later fine tuning stage. Our work is a hybrid between these two approaches, similar to ALBEF [li2021align]. The main difference between ALBEF and ours is the codebook and various related contrastive losses.

In vision language learning, codebook has been used in a number of recent works, mostly for image tokenization. BEiT [bao2021beit] constructed a dictionary of visual words, then used it to form mask image modeling task in the same fashion as mask language modeling. SOHO [huang2021seeing] integrated visual dictionary to the main model and jointly trained both of them. Both works quantized the visual input space. In our work, codebook is used to quantize the joint output space , where multimodal views are aligned via optimal transport [ambrosio2008gradient]. Other concurrent works to ours include [li2020unimo, li2021align]. They both align cross-modal instances using InfoNCE [oord2018representation]. In contrast, we enforce both unimodal and cross-modal alignment, both at the instance level and at the cluster level.

Self-supervised Contrastive Learning The goal of contrastive learning [hadsell2006dimensionality]

is to attract positive sample pairs and repulse the negative sample pairs. Recently, it has been widely used in computer vision for unsupervised and self-supervised representation learning 

[he2020momentum, chen2020simple, caron2021emerging]. Contrastive reasoning is typically formed based on two augmented views of the same input image. One of the main challenge is feature collapsing, and in practice, a large number of negative samples are required, through either large batch size [chen2020simple] or memory banks [he2020momentum, wu2018unsupervised], to alleviate this problem. Several recent works have shown that one can learn unsupervised features without discriminating instances. Deep clustering [caron2018deep] and SwAV [caron2020unsupervised] incorporate online clustering into Siamese networks. In BYOL [grill2020bootstrap], features are trained by matching them to representations obtained by a momentum encoder. DINO [caron2021emerging] instantiates the momentum encoder with a vision-transformer and adopts a teacher-student distillation paradigm [hinton2015distilling, xie2020self, duan2021slade]. Our alignment techniques and momentum update were inspired by these works and can be considered as extensions to the multimodal setting.

Figure 2: Overview of our framework. For simplicity, we only display a pair of teacher-student encoders (e.g., teacher for the image and student for the text) and similarly for the memory queue. The teacher is updated with an exponential moving average of the student (from the same modality). The codebook helps bridge the gap between the different modalities. The entire framework is end-to-end optimized.


3 Method

Our goal is to learn explicit alignment between image and text features to facilitate multimodal interactions. We illustrate CODIS in Figure 2 and propose a pseudo-code implementation in Algorithm 1. It shares some similarities with self-supervised contrastive learning [he2020momentum, caron2020unsupervised]. We treat image and text modalities as two views and adopt a teacher-student distillation paradigm [grill2020bootstrap, caron2021emerging] to enforce unimodal and cross-modal alignment. To overcome the gap between multimodal distributions, we also learn a codebook, which serves as a bridge to help align features between different modalities. We organize the content of this section as follows.

In Section 3.1, we present multimodal codebook learning, how it’s optimized and how to leverage it to resolve distribution mismatch between multimodal inputs. In Section 3.2, we introduce how to achieve unimodal and cross-modal alignment under the teacher-student distillation learning formulation. Finally, we explain how our proposed two components integrate into the V&L framework in Section 3.3.

# gs, gt: student/teacher networks for image
# fs, ft: student/teacher networks for text
# C: codebook d-by-K
# Qv, Qt: image/text queue, d-by-M
# tmp, learnable temperature
for (img, txt) in loader: # a minibatch with N samples
    # teacher/student’s image view
    img_t, img_s = gt(img), gs(img) # N-by-d
    # teacher/student’s text view
    txt_t, txt_s = ft(txt), fs(txt) # N-by-d
    # calculate codebook loss
    I2P, T2P = img_t@C, txt_t@C, # N-by-K
    Tg, Tf = IPOT(1-I2P), IPOT(1-T2P) # refer to Algo 2
    L_ot = Trace(I2P.t()@Tg).sum() + Trace(T2P.t()@Tf).sum()
    L_code = H(img_s@C, Tg) + H(txt_s@C, Tf) + L_ot
    # calculate alignment loss
    L_cross = H(img_s@Qt, img_t@Qt) + H(txt_s@Qv, txt_t@Qv)
    L_unimo = H(img_s@Qv, img_t@Qv) + H(txt_s@Qt, txt_t@Qt)
    L_align = L_cross + L_unimo
    # enqueue/dequeue
    update_queue(Qv, img_t, Qt, txt_t)
    # pretraining loss
    L_pretrain = L_itm + L_mlm
    loss = L_code + L_align + L_pretrain
    loss.backward() # back-propagate
    # student, teacher updates
    update(gs, fs) # SGD
    ema(gs, gt, fs, ft) # momemtum update
def H(s, t):
    t = t.detach() # stop gradient
    s = softmax(s / tmp, dim=1)
    return - (t * log(s)).sum(dim=1).mean()
Algorithm 1 CODIS pseudocode
1:  Input: distance/similarity matrix , ,

, probability vectors

2:  ,
3:  ,
4:  for  do
5:       // is Hadamard product
6:      for  do
7:          ,
8:      end for
10:  end for
11:  Return
Algorithm 2 IPOT Algorithm.

3.1 Multimodal Codebook Learning

We propose to learn a codebook to facilitate aligning multimodal semantics. It’s a collection of learnable prototypes or codewords. We use them interchangeably in this paper. With codebook, we encode image and text into a joint vision-language embedding space and learn the alignment by contrasting their prototype assignments. The codebook can also be interpreted as underlying feature distribution for the paired data [chen2020graph]. In this way, by aligning features from each modality with the codebook, we implicitly align multimodal features indirectly. In other words, the codebook serves as a “bridge” between the modalities (See Figure 1).

We denote the learnable codebook as , where is the dimension for each code and equals to the number of codewords (i.e., K). We set , same as the dimension of projected image/text features. Each is a prototype.

Given image or text feature vectors (superscript denotes features extracted from the momentum teacher encoder), we compute an optimal cost mapping from the feature vectors to the prototypes. We denote such mapping as a transport plan , obtained using Optimal Transport [ambrosio2008gradient, chen2020graph]. Without loss of generality, we denote as the projected features for either image or text and optimize the following objective,


where , denotes an -dimensional all-one vector. is the cost matrix given by () and represents the Frobenius dot-product. We use Tg and Tf for the optimal transport plan for image and text in Algorithm 1, and corresponds to the cost matrix for image modality. It’s similar for text.

To solve for the optimal transport plan, we adopt an iterative algorithm shown in Algorithm 2. It takes normalized feature matrix , codebook as input and output an optimal tranpsort plan . Internally, the algorithm tries to minimize the optimal transport (OT) distance, optimized to pick similar for each based on score ( row of ). In other words, can be viewed as a distance metric between prototypes and features. When solved, OT yields a sparse solution containing at most () non-zero elements, leading to a robust and meaningful alignment [de2011optimal].

In the codebook loss that we are going to formulate, will be used as ground-truth signals to guide the feature-to-prototype alignment. We use cross entropy loss and adopt a teacher-student distillation approach to construct the loss for optimizing the codebook as well as the feature encoders,


where is the predicted metric calculated with the features from the student encoders while is calculated with features from the teacher encoders using Algorithm 2. The reason is that the teacher encoders are updated via exponential moving average, which helps avoid abrupt changes in codebook learning.

We additionally add a regularization term . The overall loss for multimodal codebook learning is as follows,


As shown in Figure 3, codebook acts as a bridge between the image and text modality, as both text to prototype loss () or image to prototype loss (

) chain features from both modalities. For example, Text to Prototype loss chains Image-Prototype Transport Plan and Text-Prototype Similarity and vice versa. More importantly, learning codebook allows contrasting features across modalities at the prototype level, i.e, feature distribution matching. When calculating the transport plan, we use the teacher features as they provide a more stable supervision signal to guide the learning of the student. The calculated losses will be backpropagated to update both the codebook and student encoders.

Figure 3: This is the diagram illustrating how to calculate four codebook losses. “ ”: softmax operator. “ ”: IPOT algorithm. “ ”: OT loss. “ ”: cross entropy.

3.2 Teacher-student Distillation Learning

This loss is designed to align the features from two uni-modal encoders, which is inspired by the recent success of SSL learning [he2020momentum, caron2021emerging]. Our motivation is that image and text can be treated as two “views” of the same entity, and we adopt a teacher-student distillation paradigm to align them. Since the raw feature directly from unimodal encoders are in different feature spaces, we learn a joint embedding space of dimension , for image and text student features. Following [he2020momentum, li2021align], we store features from the teacher encoders in memory queues , for image and text respectively.

For a pair of image and text, we can calculate the cross-modal similarity and intra-modal similarity as follows:


where pseudo image negatives for estimating

is sampled from the image queue and similarly for . In addition to [li2021align], we also considered unimodal (intra) alignment. Intuitively, enhancing unimodal feature representation lays a better foundation for cross-modal alignment.

To further smooth out the learning process, we use the features from the momentum teacher to provide the soft distillation target, (refer to Algorithm 1 for details). The loss for intra/cross-modal alignment is defined as,


where is cross entropy. This objective can also be viewed as knowledge distillation, between teacher encoders and student encoders from the same modality (i.e., and ), as well as between teacher encoders and student encoders from different modality (i.e., and )). Parameters for the teacher encoder is an exponential moving average of the student, detached from gradient update. We adopt momentum update similar to [he2020momentum] to update the teacher encoders:


is the momentum parameter. In practice, we set , in order to smoothly update teacher encoders.

3.3 Self-supervised Pre-training

In this section, we will first introduce two commonly used objectives for multimodal training frameworks: (i) masked language modeling loss (MLM) and (ii) image-text matching (ITM) on the multimodal encoder. Then we discuss how codebook and teacher-student distillation components are integrated. We denote the image and text features extracted by student network as and , respectively. Specifically, is the image [CLS] token, are image patch embeddings. Similarly, indicate the text [CLS] token, are word embeddings.

3.3.1 Image-Text Matching (ITM) Loss

To fuse vision and language representations, we adopt ITM that is widely used in modern V&L frameworks. Given an arbitrary pair of image and text, ITM predicts whether they are aligned (positive pairs) or not (negative pairs). This procedure can be formulated as a binary classification problem.

Specifically, [CLS] token from the fusion encoder is used as the joint representation of the image-text pair. ITM head is a fully connected layer to predict the matching probability . We assume that each image-text pair sampled from the pre-training datasets is a positive example and construct negative examples through the following strategy: For each image within the batch, we sample one negative text from the same batch based on the contrastive similarity distribution. So that text that is more similar to this image will have a higher chance to get sampled. Similarly, one hard negative image will be sampled for each text . We denote as the ground-truth labels indicating whether the image-text pair is positive or negative.


where is the cross entropy operator.

3.3.2 Masked Language Modeling (MLM) Loss

We follow the design of MLM loss from BERT [devlin2018bert], which aims to predict the ground-truth labels of masked text tokens . Specifically, we randomly mask out 15% of input text tokens, those masked tokens are replaced with special token [MASK]. Different from BERT, our MLM loss is conditioned on both surrounding text tokens and image representations. Assume the predicted token probability is , we construct the loss objective as follows,


where is the text token sequence after masking.

3.4 Summary

We simultaneously optimize the codebook and the student encoders within the framework in an end-to-end manner, employing the losses discussed in previous sections as follows,


among which MLM and ITM loss have been widely used in many V&L methods particularly those “early-fusion” frameworks. The ica loss is the main objective function for “late-fusion” V&L frameworks. CODIS combines the merits of both “early-fusion” and “late-fusion” approaches, by explicitly learning alignment along with fusion.

Intra-cross alignment () loss described in Section 3.2 can be viewed as an instance-to-instance alignment loss, similar to the one in [li2021align]. The difference is we consider both intra and cross modal alignment. We assume that a stronger unimodal representation can lay a solid foundation for cross-modal representation. Empirical evidence is provided in Section 4.4. The codebook loss () designed in Section 3.1 measures the the distance between the transport plan and similarity matrix. It contrasts features at the prototype level and can be interpreted as distance metric matching [caron2018deep, chen2020graph]. Combining these two help avoid prototype collapsing problem, as online prototype clustering requires careful tuning [caron2020unsupervised]. Finally, The supervision signals for both intra-cross alignment loss and codebook loss require features from the momentum teacher and we adopt a teacher-student distillation approach. This can be seen as a generalization of unimodal SSL into the multimodal setting, under the V&L framework.

4 Experiments

To evaluate our approach, we conduct extensive studies on commonly used benchmarks and present experimental comparisons against state-of-the-art V&L methods as shown in this section. We follow previous experimental protocols [chen2020uniter, li2021align]

for fair comparisons. We use COCO

[lin2014microsoft], Visual Genome (VG) [krishna2017visual], Conceptual Captions (CC) [sharma2018conceptual], and SBU Captions [ordonez2011im2text] as the pre-training dataset in our study, where a total of 4.0M unique images and 5.1M image-text pairs are covered.

Method MSCOCO (5K) Flickr30K (1K)

Text Retrieval Image Retrieval Text Retrieval Image Retrieval

R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
ImageBERT [qi2020imagebert] 44.0 71.2 80.4 32.3 59.0 70.2 70.7 90.2 94.0 54.3 79.6 87.5
Unicoder-VL[li2020unicoder] - - - - - - 64.3 85.8 92.3 48.4 76.0 85.2
UNITER [chen2020uniter] - - - - - - 80.7 95.7 98.0 66.2 88.4 92.9
ViLT [kim2021vilt] 56.5 82.6 89.6 40.4 70.0 81.1 73.2 93.6 96.5 55.0 82.5 89.8
CLIP [radford2021learning] 58.4 81.5 88.1 37.8 62.4 72.2 88.0 98.7 99.4 68.7 90.6 95.2

ALIGN [jia2021scaling]
58.6 83.0 89.7 45.6 69.8 78.6 88.6 98.7 99.7 75.7 93.8 96.8

ALBEF 4M [li2021align]
68.6 89.5 94.7 50.1 76.4 84.5 90.5 98.8 99.7 76.8 93.7 96.7
Ours 71.5 91.1 95.5 53.9 79.5 87.1 91.7 99.3 99.8 79.7 94.8 97.3

Table 1: Performance comparison of zero-shot image-text retrieval on MSCOCO and Flickr30K datasets.
Method MSCOCO (5K) Flickr30K (1K)

Text Retrieval Image Retrieval Text Retrieval Image Retrieval

R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
ImageBERT [qi2020imagebert] 66.4 89.8 94.4 50.5 78.7 87.1 87.0 97.6 99.2 73.1 92.6 96.0

UNITER [chen2020uniter]
65.7 88.6 93.8 52.9 79.9 88.0 87.3 98.0 99.2 75.6 94.1 96.8

VILLA [gan2020large]
- - - - - - 87.9 97.5 98.8 76.3 94.2 96.8

OSCAR [li2020oscar]
70.0 91.1 95.5 54.0 80.8 88.5 - - - - - -

ViLT [kim2021vilt]
61.5 86.3 92.7 42.7 72.9 83.1 83.5 96.7 98.6 64.4 88.7 93.8

UNIMO [li2020unimo]
- - - - - - 89.7 98.4 99.1 74.6 93.4 96.0

SOHO [huang2021seeing]
66.4 88.2 93.8 50.6 78.0 86.7 86.5 98.1 99.3 72.5 92.7 96.1

ALBEF 4M [li2021align]
73.1 91.4 96.0 56.8 81.5 89.2 94.3 99.4 99.8 82.8 96.7 98.4

75.3 92.6 96.6 58.7 82.8 89.7 95.1 99.4 99.9 83.3 96.1 97.8

Table 2: Performance comparison of fine-tuned image-text retrieval on MSCOCO and Flickr30K datasets.

4.1 Downstream Tasks

Image-Text Retrieval consists of two tasks: (1) image as query and text as targets (TR); (2) text as query and image as targets (IR). The pre-trained model is evaluated on Flickr30K [plummer2015flickr30k] and COCO [lin2014microsoft] by following both fine-tuning and zero-shot settings. For the fine-tuning setting, the pre-trained model is fine-tuned on the training data and evaluated on the validation/test data. For the zero-shot setting, the pre-trained model is directly evaluated on the test data without any further training. In particular, for zero-shot retrieval on Flickr30K, we follow the procedure proposed in [li2021align] (zero-shot evaluating on Flickr with the model fine-tuned using MSCOCO).

Visual Question Answering (VQA) [goyal2017making] aims to predict the answer given an image and a question (in text format), which requires an understanding of vision, language and commonsense knowledge to answer. We consider this task as a generation problem by following the same setting in [li2021align]. Specifically, an answer decoder is fine-tuned to generate the answer from the 3,192 candidates.

Visual Entailment (SNLI-VE)

[xie2019visual] predicts whether an given image semantically entails a given text, which is a three-classes classification problem. Specifically, the class or relationship between any given image-text pair can be entailment, neutral, or contradictory. Compared with VQA, this task requires fine-grained reasoning.

Visual Reasoning (NLVR) [suhr2018corpus] determines whether a natural language caption is true about a pair of photographs. We evaluate our model on NLVR dataset which contains 107,292 examples of human-written English sentences paired with web photographs. Since this task takes a text and two images as input, we extend our model by following [li2021align].

4.1.1 Implementation Details

All of our experiments were performed on NVIDIA A100 GPUs. We adopt ViT-B/16 [dosovitskiy2020image] as our vision encoder. The text encoder uses BERTbase with 123.7M parameters. We set queue size to be , codebook size as and moving average

. For the pre-training stage, the model is trained for 30 epochs with a batch size of 512. We use mini-batch AdamW optimizer

[loshchilov2017decoupled] with a weight decay of 0.02. The learning rate is initialized as and first warmed-up to after 1,000 iterations. Then it’s decreased with a cosine decay strategy to . For data augmentation, we randomly crop each image and resize its size to 256256, and apply RandAugment [cubuk2020randaugment]. During fine-tuning, the image resolution is increased to 384

384 and the positional encoding is interpolated according to the number of image patches.

4.2 Evaluation on Image-Text Retrieval

For the image-text retrieval tasks, we conduct two different scenarios for evaluation: “zero-shot” retrieval task and “after-finetuning” retrieval task, following the setting in [li2021align, chen2020uniter, li2020oscar]. We compare with both early-fusion methods such as [chen2020uniter, li2020oscar, kim2021vilt] and late-fusion methods such as [radford2018improving, jia2021scaling]. ALBEF [li2021align] is an hybrid approach that also performs feature alignment along with fusion. Results in Table 1 and 2 show consistent improvements of our approach against prior state-of-the-arts.

“Zero-shot”: As shown from Table 1, CODIS outperforms existing baselines with a clear margin across the two datasets, for both image and text retrieval tasks, especially at R@1. Compared to the best-performing early-fusion approach [chen2020uniter], we obtain a margin of / TR/IR in terms of R@1 on Flickr30K. When compared to highest late-fusion approach [jia2021scaling], there’s a rise of / TR/IR in R@1 on Flickr30K and / increase of TR/IR in R@1 on MSCOCO, despite the fact that ALIGN [jia2021scaling] uses 1.8B data in training (approx. 360 more image-text pairs than our model). Our approach also outperforms ALBEF 4M [li2021align] with a clear margin of 1.2%/1.3% in terms of R@1 for TR/IR on Flickr30K and 2.5%/2.0% R@1 for TR/IR on MSCOCO, revealing that our model can further benefit from codebook representation learning.

“After-finetuning”: This task showcases the ability of V&L pretraining via transfer learning. For small datasets such as Flickr30K, performance gap tends to reduce as the model converges. However, our approach still achieves the best result in most of the metrics and the largest margins occur for R@1, especially on MSCOCO. Compared against the closest performing method ALBEF [li2021align], CODIS obtains an improvement of TR/IR in R@1 on MSCOCO, which is a similar gap as in the zero-shot setting between the two approaches, providing evidence to the effectiveness of CODIS for transfer learning.

= 0.48mm = 0.48mm Method VQA NLVR SNLI-VE test-dev test-std dev test-P val test VisualBERT [li2019visualbert] 70.80 71.00 67.40 67.00 - - VL-BERT [lu2019vilbert] 71.16 - - - - - LXMERT [tan2019lxmert] 72.42 72.54 74.90 74.50 - - 12-in-1 [lu202012] 73.15 - - 78.87 - 76.95 UNITER [chen2020uniter] 72.70 72.91 77.18 77.85 78.59 78.28 VL-BART/T5 [cho2021unifying] - 71.3 - 73.6 - - ViLT [kim2021vilt] 70.94 - 75.24 76.21 - - OSCAR [li2020oscar] 73.16 73.44 78.07 78.36 - - VILLA [gan2020large] 73.59 73.67 78.39 79.30 79.47 79.03 ALBEF 4M[li2021align] 74.54 74.70 80.24 80.50 80.14 80.30 Ours 74.86 74.97 80.50 80.84 80.47 80.40

Table 3: Comparison with variety of state-of-the-art methods on downstream vision-language tasks: VQA, NVLR, SNLI-VE.

4.3 Evaluation on VQA, NLVR and VE

Following previous approaches [chen2020uniter, li2021align], we further report performances of CODIS on various other vision-language tasks such as VQA, NLVR and VE. It’s worth noting that some results are not directly comparable as [chen2020uniter] additionally uses out-of-domain data,  [li2020oscar] leverages additional object tags and [gan2020large] with adversarial data augmentation. Nevertheless, we observe consistent improvement of our method on all tasks across different datasets in Table 3.

Objective functions Flickr30K (1K) MSCOCO (5K)

Text Retrieval Image Retrieval Text Retrieval Text Retrieval

R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
a: MLM+ITM+ITC (cross align) 84.90 97.20 99.00 68.18 88.58 93.02 68.6 89.5 94.70 50.10 76.40 84.50
b: MLM+ITM+ITC (intra + cross) 85.80 96.80 98.10 69.70 89.60 93.48 69.86 89.48 94.42 50.52 77.02 85.17
a + codebook (teacher feature) 86.00 97.00 98.20 70.18 90.66 94.44 70.74 89.54 94.88 51.39 77.86 85.60
b + codebook (student feature) 86.30 96.90 98.30 70.34 90.0 93.84 71.12 89.62 94.78 51.40 77.42 85.53
b + codebook (teacher feature) 86.70 97.30 98.70 71.40 90.82 94.62 71.10 90.60 95.10 52.10 78.00 85.90
Table 4: Performance comparison of zero-shot image-text retrieval on Flickr30K and COCO datasets for ablation study.
Figure 4: Grad-CAM visualization on the cross-attention maps corresponding to individual words

4.4 Ablation Study

In this section, we do ablation studies on the performance of our approach with different variants of CODIS. To get a clear understanding about the effects of each component, we perform comparisons under the zero-shot setting without any finetuning. Note that the setting here for Flickr30K is different than the one in Section 4.2, as the latter reports numbers based on the finetuned model on MSCOCO (5K). Refer to  [chen2020uniter] for more details.

Results are summarized in Table 4. By removing the effect of codebook, we provide two baselines that perform alignment at the instance level, namely cross-modal alignment only and intra + cross alignment. The former is an equivalent of ALBEF [li2021align], as both consider only alignment across modalities. The performances consistently increase for all R@1 TR/IR metrics (+0.9%/+1.52% in R@1 for TR/IR on Flickr and +1.26%/+0.42% on in R@1 for TR/IR on MSCOCO) by involving intra-modal alignment, i.e., enhancing unimodal representations.

We observe a consistent improvement over the two baselines when codebook is considered. In this genre, we provide three variants of CODIS designs. The 1st and 3rd row compare the effects of intra-modal alignment whereas the 2nd and 3rd row studies the effects of using student and teacher features for computing the codebook loss. This experiment also serves to support the validity by combining teacher-student distillation with codebook representation learning. Combining the two contributions, CODIS improves the first baseline by a clear margin of absolute R@1 for TR/IR on Flickr and in R@1 for TR/IR on MSCOCO.

4.5 Cross-attention visualization

We visualize the cross-attention maps using Grad-CAM[selvaraju2017grad] following [li2021align] to provide qualitative assessment of CODIS. Figure 4 shows that CODIS is able to associate language with “regions of interest” by attending to meaningful objects and locations, visually reflecting the quality of our model in multimodal alignment. For example, in the first row of the figure, the model attends to all men when word “person” is given, while for words such as “tricks” and “takes”, the model performs surprisingly well, by “focusing” exclusively on the related persons. In the second example, we choose a scene where multiple correspondences exist (e.g., trees and sunny day). The model seems to allocate more attention to trees closest to the camera and can differentiate trees from grass. It’s interesting to observe that the model switches its “attention” from the upper-body of the giraffe to its feet when the word changes from “giraffe” to “walking”, demonstrating the model’s capability in understanding the semantic relations between image and text.

5 Conclusion and Future Work

Vision and language pretraining is attracting growing attention of the computer vision community and has exhibited great potential across a diversity of vision-language downstream tasks. One of the keys to the success of V&L is to improve multimodal alignment. In this paper, we propose multimodal alignment using representation codebook, which acts as a medium between the modalities. We also make a connection between self-supervised learning and V&L pretraining, by generalizing teacher-student distillation learning to the multimodal setting under the V&L framework. Our work is a step toward more principled multimodal alignment. We hope to inspire more works in this direction.


The authors would like to thank Chenyang Tao for helpful comments on CODIS experiments.