Log In Sign Up

CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval

by   Licheng Yu, et al.

We introduce CommerceMM - a multimodal model capable of providing a diverse and granular understanding of commerce topics associated to the given piece of content (image, text, image+text), and having the capability to generalize to a wide range of tasks, including Multimodal Categorization, Image-Text Retrieval, Query-to-Product Retrieval, Image-to-Product Retrieval, etc. We follow the pre-training + fine-tuning training regime and present 5 effective pre-training tasks on image-text pairs. To embrace more common and diverse commerce data with text-to-multimodal, image-to-multimodal, and multimodal-to-multimodal mapping, we propose another 9 novel cross-modal and cross-pair retrieval tasks, called Omni-Retrieval pre-training. The pre-training is conducted in an efficient manner with only two forward/backward updates for the combined 14 tasks. Extensive experiments and analysis show the effectiveness of each task. When combining all pre-training tasks, our model achieves state-of-the-art performance on 7 commerce-related downstream tasks after fine-tuning. Additionally, we propose a novel approach of modality randomization to dynamically adjust our model under different efficiency constraints.


page 1

page 3

page 8


FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning

Multimodal tasks in the fashion domain have significant potential for e-...

MAKE: Product Retrieval with Vision-Language Pre-training in Taobao Search

Taobao Search consists of two phases: the retrieval phase and the rankin...

ACE-BERT: Adversarial Cross-modal Enhanced BERT for E-commerce Retrieval

Nowadays on E-commerce platforms, products are presented to the customer...

EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

While large scale pre-training has achieved great achievements in bridgi...

HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval

In the past few years, the emergence of vision-language pre-training (VL...

UIBert: Learning Generic Multimodal Representations for UI Understanding

To improve the accessibility of smart devices and to simplify their usag...

MURAL: Multimodal, Multitask Retrieval Across Languages

Both image-caption pairs and translation pairs provide the means to lear...

1. Introduction

Figure 1. Example cross-modal and cross-pair data. Users could use text query to do product search. Some users tag the relevant products when uploading their multimodal media. On product page, there could be multiple views of products. While those medias are of different type (text, multimodal, image), they are linked with the same product.

At Facebook, nearly every post related to commerce is multimodal, e.g., a Marketplace111 post is composed of one or several views of a product associated with its product description, a Shop product listing is composed of the product images and detailed specifics describing the product, e.g., title, attribute, size, material, and influencers upload their fashion posts to Instagram222 with captions and hashtags. Since the announcement of Facebook/Instagram Shops333 on Facebook Business, there has been an exploding increase of usage of our commerce platforms. For example, users perform text query search on Marketplace every second looking for specific products, users also tag the products (e.g., t-shirt, necklace) while uploading a post, they also trigger visual search looking for similar products appearing in other images on the platform, etc. The large-scale commerce-related data and various use cases motivate us to build a commerce-specific multimodal representation for the post.

Recently vision-and-language representation learning is becoming a more and more popular research topic. This trend has also motivated people to study the commerce-specific pre-training (Gao et al., 2020; Zhuge et al., 2021; Zhu et al., 2021; Dong et al., 2021). In those works, the authors pre-train the transformer-based (Vaswani et al., 2017)

model on commerce image-text pairs, then fine-tune it on image-text retrieval, image captioning, category recognition, etc. However, most of existing works were trained on the medium-scale image-text pairs and evaluated on limited academic tasks. This raises a question: can Facebook learn a more generalized multimodal representation for various practical commerce-related applications?

In this spirit, we introduce Commerce MultiModal Representation (CommerceMM), a large-scale pre-trained model for joint multimodal commerce embedding at Facebook. We scale up the pre-training data from less than 1 million to hundreds of million. Our model is composed of an image encoder, a text encoder (transformer-based), and a multimodal fusion encoder (transformer-based), as in Fig. 2. We then propose two sets of pre-training tasks for the model and evaluate on a wide range of 7 commerce-related tasks.

The first set of pre-training tasks consists of 5 effective ones on the image-text pairs, including Masked Language Modeling (MLM), Masked Image Modeling KL-Divergence (MIM-kl), Masked Image Modeling Feature Regression (MIM-fr), Image-Text Contrastive Learning (ITC), and Image-Text Matching (ITM). While previous works (Kim et al., 2021; Dou et al., 2021) show MIM is not helpful for multimodal pre-training and typically MLM is the most effective task (Li et al., 2021; Chen et al., 2020), our proposed MIM tasks are essential. The key difference lies in () we use a larger mask ratio of 50% instead of 15% used in (Chen et al., 2020; Tan and Bansal, 2019; Lu et al., 2019); () we only recover the signal at image’s [CLS] token instead of reconstructing every masked patch; () the supervision comes from the other two views (ClusterFit (Yan et al., 2020) and GrokNet (Bell et al., 2020)) of the intact raw image.

To our knowledge, all existing multimodal pre-training take the image-text pairs as the default input for the representation learning. However, in practice, specifically in the commerce domain, we have access to more diverse formats of data, e.g., multiple images, texts, and other valuable data, e.g., query and click, around the same product. For example, in Fig. 1(a), we show users could perform query search to find the related product. User’s click on some product builds a text-to-multimodal mapping between the query text and the multimodal product page. Fig. 1(b) shows an example of users tagging the product when uploading their own multimodal media consisting of a photo and a caption, which builds a multimodal-to-multimodal mapping. Fig. 1(c) shows a product page with multiple views of the same product, where each of them could be a source image to search for the associated product, making an image-to-multimodal mapping. We call such data as cross-modal and cross-pair data. Apparently, the common image-text pre-training tasks like ITC and ITM can no longer be applied since their input is fixed to be the image-text pair and can only handle the image-text relation. This motivates us to propose the second set of pre-training tasks - Omni-Retrieval Pre-training. We propose to build the relations between any modality type from image, text, and multimodality. Specifically, we encapsulate the cross-pair data into the form of (source image, source text, target image, target text). We predict three embeddings for the source and target respectively: visual embedding, textual embedding, and multimodal embedding. Exhaustive matching results in 9 types of pairwise relations, where each relation can be learned with contrastive learning mimicking 9 retrieval tasks in total. With the 9 Omni-Retrieval tasks, our pre-trained model is shown to learn a more discriminative and generalized representation for any alignment space.

We combine both 5 image-text tasks and 9 omni-retrieval tasks together during pre-training. While the combined pre-training (in total 14 tasks) seem unbelievably heavy at first glance, our pre-training is performed in a simple and efficient manner. During training, there is only 1 forward/backward for each set of tasks, and we found the two sets of tasks are complementary with each other.

Besides pre-training, we conduct a preliminary exploration of another novel idea – modality randomization. We change the text encoder and multimodal fusion encoder layers dynamically during training. At each training step, we randomly assign and transformer layers (while keeping the sum of and unchanged) to the text and multimodal encoder. The shuffling role makes each layer learn from both modalities and allows knowledge sharing. Our experiments show a light-weighted 2nd-stage pre-training with modality randomization further improves multimodal understanding. Additionally, our model can be flexibly adjusted to different architectures per different efficiency constraints.

We introduce 7 diverse commerce tasks to evaluate the pre-trained model, including Catalog Categorization, Marketplace Categorization, Image-to-Text Retrieval, Text-to-Image Retrieval, Query-to-Product Retrieval, Image-to-Product Retrieval, and Image-to-Image Retrieval. We present extensive experiments and analyses to provide useful insights on the effectiveness of each pre-training task and dynamic model architecture. The experiments show that with all 14 pre-training tasks, our model can achieve state-of-the-art performance across all 7 downstream tasks. Furthermore, with dynamic model randomization, we can achieve an even better performance with a smaller model size. Our contributions are summarized as follows: (

) We propose 5 effective image-text pre-training tasks on a large-scale commerce multimodal dataset. () We introduce a novel pre-training approach of Omni-Retrieval for the cross-modal and cross-pair data. () We validate our approach on 7 diverse tasks with thorough study and insightful analysis.

2. Related Work

Inspired by the success of BERT (Devlin et al., 2018)

in natural language processing, there has been a surge of interest in vision-and-language representation learning, by pre-training on the large-scale multimodal data with a transformer-based 

(Vaswani et al., 2017) model architecture, then fine-tuning on various downstream tasks (Sun et al., 2019; Chen et al., 2020; Su et al., 2019; Lu et al., 2019; Li et al., 2019, 2020a, 2020b; Zhou et al., 2020; Li et al., 2021; Wang et al., 2021c; Li et al., 2020c; Zhang et al., 2021; Wang et al., 2021b, a; Tan and Bansal, 2019). While different works use different architectures (e.g. two-stream (Lu et al., 2019; Tan and Bansal, 2019; Yu et al., 2020) vs. single-stream (Li et al., 2019, 2020a; Su et al., 2019; Chen et al., 2020)), features (e.g. regions (Anderson et al., 2018) vs. grids (Huang et al., 2020)), backbones (e.g. ConvNets (Huang et al., 2020) vs. Transformers (Kim et al., 2021)) etc., the shared goal of visual-linguistic pre-training is to exploit large-scale, paired image-and-text corpora (Lin et al., 2014; Krishna et al., 2017; Sharma et al., 2018; Ordonez et al., 2011; Jia et al., 2021; Radford et al., 2021) and obtain models that are implicitly pre-built with the appropriate correspondences to understand the multi-modal data. The level of understanding is benchmarked with multiple multimodal downstream tasks (Zellers et al., 2019; Plummer et al., 2015; Yu et al., 2016; Antol et al., 2015).

Figure 2. CommerceMM Model Architecture with the image-text pre-training and omni-retrieval tasks.

Various tasks have been introduced for the multi-modal pre-training. The most notable three are Masked Language Modeling (MLM), Masked Image Modeling (MIM), and Image-Text Matching (ITM) – which are the direct counterparts of the BERT objectives. Several other variants of these three tasks have also been explored, such as predicting object tags (Li et al., 2020d; Hu et al., 2021), masked region classification (Lu et al., 2019; Chen et al., 2020), sequential caption generation (Zhou et al., 2020; Wang et al., 2021c), image-text contrastive learning (Li et al., 2021; Jia et al., 2021; Radford et al., 2021), etc. Recent works (Dou et al., 2021; Kim et al., 2021)

found MIM may not be essential for pre-training among those tasks. In this work, we found that our proposed MIM is more effective than MLM. Inspired by recent works of self-supervised learning on vision 

(Xie et al., 2021; He et al., 2021), we propose to mask out image patches with larger proportion and follow MaskFeat (Wei et al., 2021) to reconstruct other views of the whole image rather than recovering those masked regions only.

Besides, all existing pre-training works are conducted on the image-text pairs. We can certainly convert the cross-modal and cross-pair data in Fig. 1 to the image-text pairs, but that would constraint the model from learning more general representations in other cross-modal space, e.g., multimodal-to-multimodal alignment. In comparison, we propose a novel approach of Omni-Retrieval pre-training to build the relations between any modality towards learning a universal representation.

3. Model and Pre-training

3.1. Model Overview

The model architecture of CommerceMM is illustrated in Fig. 2(a), which is composed of an image encoder, text encoder, and a multimodal fusion encoder. Given an image, the vision encoder converts the raw pixels into a sequence of visual embeddings, i.e., flattened grid/patch-wise features. On the other side, the text encoder converts the input sentence into a sequence of word tokens and feeds them into the transformer to get textual embeddings. Inside both visual and textural embeddings, there is a special [CLS] token, which encodes the whole image and sentence representation. We feed both sequences of embeddings into another transformer - multimodal fusion encoder, where the two modalities learn to interact with each other building a contextualized multimodal embeddings. We still keep the [CLS] tokens of both modalities at the output of multimodal fusion encoder.

Our image encoder is an off-the-shelf ResNet or ViT-B/16 model, that have been weakly-supervised trained on billions of Instagram images using hashtags (Mahajan et al., 2018). Our text encoder and multimodal fusion encoder are initialized from XLM-R (Conneau et al., 2019), an unsupervised multi-lingual language model. Specifically, the text encoder inherits its first layers and the multimodal fusion encoder inherits its remaining layers, whose sum equals to XLM-R’s total layers (6, 12, or 24 depending on the chosen model size). We keep and unchanged under most setting for a fixed model architecture.

Interestingly, such design also allows us to dynamically change and during training, i.e., each layer shuffles between text encoder and multimodal encoder. In Sec. 4.6, we show the modality randomization can further enhance the multimodal understanding and make the model adjustable per different efficiency constraints.

Given an image-text pair, we denote its raw visual inputs as , and its input words as , where the subscript indicates the -th pair in the dataset. As above, an additional special [CLS] token is inserted to each sequence.

We introduce two sets of pre-training tasks: () Image-Text Pre-training, which consists of 5 tasks of MLM, MIM-kl, MIM-fr, ITC, and ITM as Fig. 2(a); () Omni-Retrieval Pre-training, which consists of 9 cross-modal and cross-pair retrieval tasks as Fig. 2(b). We denote as the mask indices, which is applied to mask out part of the sequences for self-supervised learning. The details of each task are presented as follows.

3.2. Image-Text Pre-training

Masked Language Modeling (MLM)

In MLM, we randomly mask out the input words with probability of 15% and replace the masked ones

with special token [MASK]. The goal of MLM is to predict these masked words based on their surrounding words and the visual context , by minimizing the negative log-likelihood:

where is the model parameters. Each pair is sampled from the whole training set .

Masked Image Modeling (MIM) Similar to MLM, we also sample and mask the visual inputs, e.g., patches. Previous works (Kim et al., 2021; Dou et al., 2021) found the naive masked patch regression is not helpful in multimodal pre-training. In this work, we disregard the reconstruction of each masked region, but instead recover the holistic image signal at token. We first follow (Xie et al., 2021; He et al., 2021) to use a larger masking ratio of 50% (instead of 15% as in (Chen et al., 2020; Tan and Bansal, 2019; Lu et al., 2019)). The masked patches are replaced with grey pixels. Our supervision is provided by another view of the original intact input image. While MaskFeat (Wei et al., 2021) used Histograms of Oriented Gradients (HOG) as the supervision for visual pre-training, we rely on the more discriminative signals from ClusterFit (Yan et al., 2020) and GrokNet (Bell et al., 2020) to extract additional two views of the raw image. Between the two, (Yan et al., 2020) provides the clustering probability while (Bell et al., 2020) extracts the pool5 embedding (feature output from the 5-th Conv Block). Correspondingly, we propose two MIM tasks but sharing the same objective base:

where is defined as follows.

(1) Masked Image Modeling Feature Regression (MIM-fr) MIM-fr learns to regress the multimodal transformer’s output at to the pool5 embedding from GrokNet, i.e.,

. Specifically, we apply an FC layer to convert its hidden output into a vector

of the same dimension as GrokNet’s pool5. Then we apply L2 regression on the mean square error between the two: .

(2) Masked Image Modeling KL-Divergence (MIM-kl) MIM-kl applies the soft label of ClusterFit probability(Yan et al., 2020) as the supervision signal, which is the softmax from ClusterFit output formatting the distribution of clusters . We project the multimodal transformer’s output at to the same distribution space with softmax. We aim to distill the intact knowledge from ClusterFIT into CommerceMM, by minimizing the KL divergence between the two distributions: .

Image-Text Contrastive Learning (ITC) Following (Li et al., 2021), we add an image-text contrastive loss between the visual and textual embeddings right before feeding them into the multimodal fusion module. It aims to align the two modalities into the same space before fusion. Specifically, we project the textual embedding and visual embedding at

to a normalized lower-dimensional representations via two linear transformations

and . The similarity of the text and the image is then measured by the dot product of

We apply contrastive learning to bring the matched image-text pairs in the embedding space closer than the unmatched ones as follows:

where is a learned temperature parameter.

Image-Text Matching (ITM) In ITM, the inputs are a paired sentence and image and the output is a binary label , indicating if each input pair is a match. We extract the hidden output of

at the last layer of multimodal fusion to represent the fused representation of both modalities, then feed it into an FC layer followed by a sigmoid function to predict a single score between 0 and 1. We denote the output score as

. During pre-training, we sample a positive or negative pair from the dataset at each step. The negative pair is created by replacing the image or text in a paired sample with a randomly-selected one from other samples. Following (Li et al., 2021), when ITC is applied, we could sample the hard negative pairs from ITC’s computed similarity matrix . The incorporation of those hard negatives makes ITM a harder task, which is more beneficial for the pre-training (Miech et al., 2021). We apply binary cross entropy for this loss:

3.3. Cross-Pair Pre-training: Omni Retrieval

As in Fig. 1, besides image-text pairs, there are also a huge amount of cross-modal and cross-pair commerce data. We formulate such data as two pairs, where we denote the source pair as and the target pair as . Note one of and (in the source/target pair) could be missing in some case. For example, the source for search query is only one single sentence, and the source for visual search is only one single image, while both are linked with some multimodal product pages. We replace the missing image or text with grey pixels or an empty string and introduce an indicator to tell the existence of each modality. We replicate our model for both source and target pairs sharing the same parameters, as in Fig. 2(b). We first feed the source pair to our model, our image encoder, text encoder, and multimodal fusion return three embeddings at their corresponding [CLS] respectively. With 3 simple linear transformations, we get three normalized embeddings , , and . Similarly, we can get the image embedding , text embedding , and multimodal embedding for the target pair. If a source pair is linked with a target pair, we assume any existing modality from the source would be highly correlated with every existing modality from the target. Thus we compute the similarity score between any pair of source and target embedding from the text, image, or multimodal perspective respectively as the follows:

In total, there are 9 cross-modal combinations, resulting in 9 similarity matrices within each batch. We define our Omni-Retrieval loss as the sum of the contrastive loss over the 9 similarities:

where and refers to the modality of image, text, or multimodal, and indicates if the modality of the -th input pair exists or not, which works like a gate function to turn on/off the contrastive learning for each pair.

4. Experiments

We report experimental results on three model sizes: ResNet50-based 6-layer CommerceMM, ViT-B/16-based 6-layer CommerceMM, and ViT-B/16-based 12-layer CommerceMM444 Our ResNet50-based 6-layer CommerceMM (L=6, H=512, A=8) has 172M parameters; ViT-B/16-based 6-layer CommerceMM has 234M parameters; ViT-B/16-based 12-layer-CommerceMM (L=12, H=768, A=12) has 365M parameters. (L: total number of transformer layers (text + multimodal fusion encoders); H: hidden activation dimension; A: number of attention heads. Inside each model, there are 128M parameters from XLM-R’s token embeddings.). We use MMF (Singh et al., 2020) for the implementation.

During pre-training, we set the learning rate as 5e-5, batch size as 2,048, and update in total 300K steps. We apply another 100K steps for a 2nd-stage pre-training when modality randomization in Sec. 4.6 is applied. In the 5 image-text pre-training tasks, we empirically assign an equal weight of 1.0 to MIM-kl, MIM-fr, ITM and ITC, and 0.5 to MLM. For the 9 Omni-Retrieval tasks, we also assign an equal weight of 1.0 to each of them. While there are in total 14 different tasks, our model is trained efficiently in a round-robin fashion with only two types of forward/backward. At each step, we randomly pick between the set of 5 image-text pre-training tasks or the set of 9 Omni-Retrieval tasks for updating the model parameters. It takes in total 3,840 and 4,608 A100 hours to pre-train our end-to-end 6-layer CommerceMM and 12-layer CommerceMM respectively. We evaluate the pre-trained models on 7 commerce-related tasks.

4.1. Pre-training Dataset

Our pre-training dataset consists of two subsets. The first subset is 102M image-text pairs, collected from two sources – Product Catalog Posts (50M) and Marketplace Posts (52M). The text is the concatenation of product title and description. The average number of words of the concatenated texts is 62. Note our texts are multi-lingual, including English, French, German, Spanish, Portuguese and other languages. For each image, we pre-compute its GrokNet embeddings (Bell et al., 2020) and ClusterFit probability (Yan et al., 2020) for the MIM tasks in our image-text pre-training.

The second subset is 50M cross-modal and cross-pair data for the Omni-Retrieval pre-training. The data comes from various sources:

  1. IG and FB Shops product catalog: Each product has product title/description and product image.

  2. IG and FB Shops text search queries with clicked product: The search query text is used as the query to retrieve the clicked products (product title/description, product image).

  3. IG and FB posts where a product is tagged on the post: The post usually contains a post image and post caption text. The post caption text is frequently empty. Such posts are used to retrieve the tagged product (product title/description, product image) given the post image.

4.2. Downstream Tasks

With minimal surgery of CommerceMM, our model can be adapted to new tasks that are not seen during pre-training. We introduce 7 downstream commerce-related tasks. Each of the downstream fine-tuning tasks has its own dataset and annotations.

Catalog Categorization (CC)

We annotated the fine-grained category labels for 2.5M multimodal catalog posts. The labels are tree-structured on shopping-based categories, e.g., “Home/Home Decor/Decorative Accents/Posters, Prints & Paintings”. There are in total 5,168 leaf labels. We focus on classifying the leaf label for each post. One example of our catalog posts is shown in the product page of Fig. 

1, consisting of an image, a title and a description555For those posts with multiple images, we select the first one to pair with the description making an image-text pair.. We concatenate the title and description as a full sentence. There are on average 99.6 words for each post. We add a 5,168-way classifier on top of the

of the multimodal fusion encoder and apply cross entropy loss during the fine-tuning. Overall accuracy is reported as the evaluation metric.

Pre-training Tasks Meta Avg. CC MPC T2I I2T Q2P I2P I2Pi
1 None 50.39 72.08 63.75 22.70 23.88 48.43 55.80 66.10
2 MLM 52.77 73.10 67.94 24.87 25.84 51.59 60.46 65.59
3 MIM-kl 53.59 73.26 69.04 26.31 26.91 53.89 59.18 66.54
4 MIM-kl + MIM-fr 54.18 73.27 69.12 27.88 28.61 54.05 59.48 66.83
5 MLM + MIM-kl + MIM-fr 54.19 73.64 69.55 26.66 26.98 53.47 61.66 67.30
6 MLM + MIM-kl + MIM-fr + ITM 54.62 73.64 69.45 27.82 28.33 54.63 61.63 66.87
7 MLM + MIM-kl + MIM-fr + ITM + ITC 57.87 73.76 69.61 39.11 40.30 55.60 60.03 66.65
8 Omni Retrieval (Omni) 56.27 72.98 67.81 29.69 30.78 57.34 67.98 67.31
9 MLM + MIM-kl + MIM-fr + ITM + ITC + Omni 60.64 73.77 69.73 42.05 43.06 58.48 69.20 68.16
Table 1. Ablation Study of different pre-training tasks using the fine-tuned performance of Catalog Categorization (CC), Marketplace Categorization (MPC), Text-to-Image Retrieval (T2I), Image-to-Text Retrieval (I2T), Query-to-Product Retrieval (Q2P), Image-to-Product Retrieval (I2P), and Image-to-Product-Image Retrieval (I2Pi). All results are obtained from ResNet50-based CommerceMM (6-layer). For all retrieval tasks, R@1 scores are reported. Meta Avg is the average score of 7 downstream tasks, measuring the overall performance. Dark and light grey colors highlight the top and second best results for each task.

Marketplace Categorization (MPC) Similar to Catalog Categorization, we have also annotated the fine-grained category labels for 2M Marketplace posts. There are in total 1,387 leaf labels for the collected Marketplace posts. Each post consists of an image, a title and a description. We follow the same process concatenating the title and description. On average, there are 37.3 words in each text. We add a 1,387-way classifier on the multimodal fusion encoder’s token for fine-tuning and inference. Overall accuracy is reported as the evaluation metric.

Image-Text Retrieval We collect 1M Catalog Posts for the text-to-image retrieval (T2I) and the image-to-text retrieval (I2T) tasks. We split a 10K subset for evaluation and leave the rest for model training. There are two ways to perform the image-text retrieval. One is using the ITM head to predict the matching score between the input image-text pair and rank the scores of all pairs (Chen et al., 2020; Gao et al., 2020; Zhuge et al., 2021). One is computing the similarity between image encoder and text encoder and picks the best match without deeper fusion between the two modalities as (Jia et al., 2021; Radford et al., 2021; Li et al., 2021). While the first approach may achieve better performance (Miech et al., 2021), its computational cost is huge (quadratically increasing w.r.t the size of retrieval pool), which is not applicable in practice. Thus we follow the second approach to simply use our image and text encoders of CommerceMM for the retrieval tasks. We apply contrastive learning loss during the fine-tuning. Recall@1 is used to measure the retrieval performance.

Figure 3. The model architecture for the Query-to-Product and Image-to-Product retrieval tasks.

Query-to-Product Retrieval (Q2P) We collected 1M text search queries with their clicked products. Each product is associated with a multimodal product page. We split a 10K subset for evaluation and leave the rest for model training. We consider this problem as a text-to-multimodal retrieval task. The model is shown in Fig. 3(a), where we use the CommerceMM’s text encoder to encode the input query text and whole model to encode the candidate multimodal post. Contrastive learning loss is applied during fine-tuning. We use Recall@1 as the evaluation metric.

Image-Product Retrieval We collected 1M post image queries with the tagged products. Each pair consists of a source image and a target product page. We split out a 10K subset for evaluation and set up two tasks for this dataset. The first is the image-to-multimodal retrieval as the given data format, which is called image-to-product retrieval (I2P). The model for I2P is shown in Fig. 3(b), where we use the image encoder to encode the input query image and the whole model to encode the multimodal product post. The second is the image-to-image retrieval task, where we only use the target image as the candidates. We call this task as image-to-product-image retrieval (I2Pi). For both tasks, we apply contrastive learning loss and use Recall@1 for evaluation.

4.3. Ablation Study on Pre-training

We analyze the effectiveness of different pre-training settings through ablation studies over the 7 downstream tasks. In addition to the above mentioned standard metrics, we also compute the Meta Average score (average of the results across all tasks) as a global metric. The complete ablation study is listed in Table. 1. In these experiments, we use a ResNet50-based 6-layer CommerceMM, i.e., the number of layers of text encoder and multimodal fusion together equals 6, initialized from XLM-R-small (Conneau et al., 2019). All models are trained in an end-to-end manner.

First, we provide a baseline without any multimodal pre-training involved in Line 1 (L1). In other words, this model is directly initialized from the off-the-shelf ResNet and XLM-R directly, which were pre-trained in vision-only and language-only domain.

Second, we validate the effectiveness of each pre-training task through a thorough ablation. Comparing L2 and L3, our proposed MIM-kl (L3) achieves a clear gain over MLM (L2). When applying both MIM-kl and MIM-fr, L4 further improves the performance. This is a quite different observation compared with (Kim et al., 2021; Dou et al., 2021), where previously MIM was not shown to be helpful. The difference indicates the effectiveness of our proposed MIM tasks. Interestingly, the combined MLM, MIM-kl, and MIM-fr in L5 does not quite outperform L4 (with only 0.01 gain on the Meta Average). One possible reason might be the two MIM tasks overshadows the effect of MLM. L6 and L7 adds ITM and ITC into the pre-training, both of which further improves the meta average. Notably, ITC introduces a significant improvement on image-text retrieval tasks, i.e., T2I and I2T due to the same task has been well warmed up during pre-training.

Next, we validate the contribution of Omni-Retrieval (Omni) pre-training tasks. In L8, we only apply the 9 Omni-Retrieval tasks during pre-training without any help from image-text pre-training. We observe a significant gain on Q2P, I2P, and I2Pi, which benefits from the text-to-multimodal, image-to-multimodal, and image-to-image tasks in the Omni. Additionally, Omni also helps on the first 4 image-text tasks (CC, MPC, T2I, I2T), comparing L1 and L8. When combining both image-text and Omni-Retrieval pre-training, our model in L9 achieves the best across every single task.

Last but not least, we compare the performance between I2P and I2Pi. The two tasks share the same evaluation set, where each image in I2Pi is from its corresponding product page in I2P. Interestingly, we found that without Omni Retrieval, I2Pi always performs better than I2P in L1-L7. As comparison, L8 and L9 shows I2P’s results are better than I2Pi with Omni, which aligns with our intuition that the multimodal product page contains more cues than its product image only. This indicates Omni helps learning more generalized representations under different alignment space.

Vis. Enc. (K, M) Pre-training Tasks Meta Avg. CC MPC T2I I2T Q2P I2P I2Pi
1 GrokNet Hash (0, 6) None - 71.26 65.26 - - - - -
2 GrokNet Hash (0, 6) MLM + MIM-kl + MIM-fr + ITM - 73.40 69.04 - - - - -
3 ResNet50 (0, 6) MLM + MIM-kl + MIM-fr + ITM - 73.54 69.13 - - - - -
4 ResNet50 (3, 3) MLM + MIM-kl + MIM-fr + ITM 54.62 73.64 69.45 27.82 28.33 54.63 61.63 66.87
5 ResNet50 (3, 3) MLM + MIM-kl + MIM-fr + ITM + ITC + Omni 60.64 73.77 69.73 42.05 43.06 58.48 69.20 68.16
6 ViT-B/16 (3, 3) MLM + MIM-kl + MIM-fr + ITM + ITC + Omni 62.69 73.78 69.80 43.84 44.29 61.43 73.41 72.31
7 ViT-B/16 (6, 6) MLM + MIM-kl + MIM-fr + ITM + ITC + Omni 66.85 74.31 70.60 52.10 53.72 65.70 77.16 74.36
Table 2. Effect of vision encoder, text encoder, and model size. (K, M) stands for (#text layers, #multimodal layers) inside CommerceMM, both of which are transformers. Dark and light grey colors highlight the top and second best results.

4.4. Effectiveness of Vision Encoder, Text Encoder, and Model Size

In Table 2, we first show the results using the off-the-shelf GrokNet (Bell et al., 2020) Hash feature as the image embedding. We feed the text tokens and image embedding directly into the multimodal transformer as (Chen et al., 2020; Li et al., 2019), i.e., the multimodal model is an early-fusion model without text encoder. We apply MLM, MIM-kl, MIM-fr and ITM for the multimodal pre-training, each of which is same as in Sec. 3.2666We cannot apply image-text contrastive learning or omni retrieval as there is no text encoder in the early-fusion architecture.. The mere difference lies in the masking strategy of MIM, where we randomly shuffle the 0/1 bits of the input Hash for masking. For fine-tuning, we focused on the multimodal categorization tasks of CC and MPC. We observe that even with only 4 pre-training tasks, there is a significant improvement on CC and MPC tasks from L1 over L2, indicating that our pre-training also works well with a fixed vision encoder of GrokNet. We launched L2 as our current production model. More details of the deployment and product impact are provided in Sec. 4.8. We then show the advantage of end-to-end training. Comparing L2 and L3, we observe the end-to-end trained model with ResNet50 already outperforms GrokNet’s Hash (Bell et al., 2020), which is from a ResNeXt101 model.

Next, we compare with and without text encoder. Both L3 and L4 have the same number of transformer layers, i.e., the model size is the same, and are pre-trained with the same tasks. We observe that with a 3-layer text encoder, the model in L4 achieves better performance than the early-fusion model in L3 on CC and MPC. Moreover, the introduction of a text encoder allows us to perform those text-based retrieval tasks, e.g., T2I, I2T, and Q2P. Adding the Omni-Retrieval pre-training tasks, the model in L5 further improves the performance of all the retrieval tasks, compared with L4.

We then compare the effectiveness of vision encoder. With the same input image size 224x224 and the same transformer architecture, we found ViT-B/16 (Dosovitskiy et al., 2020) (L6) brings a consistent gain over ResNet50 (L5) on each of the 7 downstream tasks, showing a better visual encoder is beneficial to the multimodal applications.

We also experiment scaling up our transformer from 6 layers (L6) to 12 layers (L7). Note for the 12-layer transformer, we simply assign its first 6 layers from XLM-R to the text encoder and leave the rest 6 layers to the multimodal fusion encoder. Comparing L6 and L7, we observe a further improvement on each task with the larger model. We leave the exploration of even larger transformer with more advanced vision encoder to the future work.

4.5. Transferability to Academic Dataset

We also evaluate how our pre-trained model performs on the academic dataset, e.g., FashionGen (Rostamzadeh et al., 2018). We strictly follow  (Zhuge et al., 2021) constructing its image-text retrieval task. In its text-to-image retrieval, the model is required to pick the matched image from 101 images given a text. In the 101 images, one is positively paired with the text and the other 100 are randomly paired but sharing the same sub-category as the positive, increasing the difficulty. The same setting is for its image-to-text retrieval. We fine-tune our smallest ResNet50-based 6-layer CommerceMM (L9 in Table 1) on the dataset with contrastive learning. In Table. 3, we compare our model with the state-of-art commerce-domain pre-trained models (Gao et al., 2020; Zhuge et al., 2021). We found even our smallest model already outperforms (Gao et al., 2020; Zhuge et al., 2021) with a clear margin, indicating CommerceMM’s superior transferability.

FashionGen T2I FashionGen I2T
R1 R5 R10 R1 R5 R10
FashionBERT (Gao et al., 2020) 26.8 46.5 55.7 24.0 46.3 52.1
KaleidoBERT (Zhuge et al., 2021) 33.9 60.6 68.6 28.0 60.1 68.4
CommerceMM (small) 39.6 61.5 72.7 41.6 64.0 72.8
Table 3. Image-Text Retrieval on FashionGen (Rostamzadeh et al., 2018).

4.6. Modality Randomization

As in Sec. 3.1, our model design allows us to dynamically change the text encoder and multimodal fusion encoder by assigning different and layers to each. Previous works (Wang et al., 2021b, a; You et al., 2021; Akbari et al., 2021) show the modal-agnostic training can be beneficial to the multimodal understanding. Our approach follows the same spirit. At each training step, we randomly assign and transformer layers (while keeping the sum of and unchanged) to the text encoder and multimodal encoder, so that every layer can share the knowledge from both text and multimodal modalities.

To validate this interesting idea, we set up a light-weighted 2nd-stage pre-training for the modality randomization with 5 image-text pre-training tasks. We then fine-tune our model on CC and MPC, both measuring the multimodal recognition capability. Fig. 4

compares the fixed-architecture and modality-randomized pre-trained models. Typically the pre-training and fine-tuning models should share the same architecture, but we propose to change the model architecture at fine-tuning stage for potential better performance. We observe the modality-randomized pre-training brings better performance than fixed-arch under any architecture setting at fine-tuning. We also found a smaller accuracy variance of the modality-randomized model under different architectures, showing its robustness to the model change.

We also experiment with changing the total number of layers. In Figure 5, we plot the CC and MPC accuracy with text encoder and different fusion layers (each is an early fusion model without text encoder). While it seems no surprise the deeper model brings better accuracy, we found our small models also perform well without notable performance drop compared with using full layers. Note our 2-layer model achieves 73.10 on CC and 67.88 on MPC, which are already better than the 6-layer model without pre-training in L1 of Table 1. Thus with modality randomization, we can flexibly adjust our model architecture to satisfy different efficiency constraints in practice.

Our modality randomization can be further explored by incorporating it into the proposed pre-training in 1 stage. We leave this promising idea to future work.

Figure 4. Effect of (K, M) from a fixed-architecture (3, 3) pre-trained model and modality-randomized pre-trained model. (K, M) stands for (#text layers, #multimodal layers).
Figure 5. Effect of total number of layers from a modality-randomized model evaluated on CC and MPC after fine-tuning. (K, M) stands for (#text layers, #multimodal layers). K=0 means early-fusion model without text encoder.

4.7. Visualization

We visualize the embeddings from the image encoder, text encoder, and multimodal fusion encoder respectively in Fig. 6. Specifically, we feed 1K multimodal catalog posts from 10 most popular categories into CommerceMM. T-SNE is applied for the visualization and the colors are corresponding to the annotated categories. We compare the results of without pre-training (initialized from ResNet+XLM-R), image-text pre-training with MLM, MIM-kl, and MIM-fr, and full 14-task pre-training. We observe that with additional retrieval tasks in pre-training, e.g., ITC and Omni, the embeddings of the same class are better clustered with closer distance. This indicates our proposed pre-training tasks helps learning a more discriminative representation for each modality.

We also visualize the text-to-image attention in Fig. 7. Comparing the the models pre-trained with 5 image-text tasks and all tasks (with Omni-Retrieval pre-training), we observe the cross-modal attention from all pre-training can better attend to the right regions referred by the key words, e.g., “dress” and “earrings”.

Figure 6. T-SNE of different pre-trained models’ embeddings.
Figure 7. Visualization of text-to-image attention from different pre-trained models.

4.8. Deployment and Product Impacts

An early version of CommerceMM model has been deployed at Facebook, i.e., L2 in Table 2. We pre-trained the GrokNet (Bell et al., 2020)-based 6-layer model using a subset of the tasks (MLM, MIM-kl, MIM-fr and ITM) and then fine-tuned on three downstream tasks. We confirmed the benefits of this framework by running A/B test experiments on Catalog Categorization (CC), Marketplace Categorization (MPC) and Catalog Attributes applications.

Catalog Category Filters Users on Shops have access to category filters which lets them search products for a specific category within a shop. With our newly launched model, we have increased the number of shops with category filter by over 4 times.

Catalog Attributes Filters Attributes are fine-grained characteristic of objects, such as color, pattern, material, etc. We ran A/B tests to confirm the improvement of attributes prediction on IG and FB Shops for three specific product attributes - color, gender and material. This model enabled a launch of these three attribute filters on 52.7% of all Shops on the platform due to significant precision and coverage improvements compared to the baseline production models.

After inference, all these category and attributes predictions are stored in a distributed key-value store, which gets consumed by multiple product groups across the company to improve various ranking and recommendation products.

This early deployed version gives a substantial evidence that the pre-training is helpful for downstream tasks. The end-to-end CommerceMM will be deployed in the same way with a simplification of removing the service call of GrokNet Hash (Bell et al., 2020), i.e., the inference will be more efficient.

5. Conclusion

We introduce CommerceMM, a large-scale commerce multimodal model at Facebook. We present 5 effective image-text pre-training tasks, and propose a novel set of Omni-Retrieval tasks on cross-modal and cross-pair data. Pre-trained on the large-scale diverse multi-lingual multimodal data, CommerceMM outperforms state-of-art models across 7 downstream tasks with a large margin. An early version of the model has been deployed at Facebook with significant product impacts.

We thank our colleagues Wenliang Gao and Maolong Tang for the in-depth discussion and feedback; and Yuyu Zhu, Xueting Yan, Kartikay Khandelwal, and Yang Bai for supporting an early version of this pre-training in another domain. We also thank Amanpreet Singh, Vedanuj Goswami, Sasha Sheng and Ronghang Hu for both development and deployment support of MMF.


  • H. Akbari, L. Yuan, R. Qian, W. Chuang, S. Chang, Y. Cui, and B. Gong (2021) Vatt: transformers for multimodal self-supervised learning from raw video, audio and text. NeurIPS. Cited by: §4.6.
  • P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, Cited by: §2.
  • S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh (2015) Vqa: visual question answering. In ICCV, Cited by: §2.
  • S. Bell, Y. Liu, S. Alsheikh, Y. Tang, E. Pizzi, M. Henning, K. Singh, O. Parkhi, and F. Borisyuk (2020)

    Groknet: unified computer vision model trunk and embeddings for commerce

    In KDD, Cited by: §1, §3.2, §4.1, §4.4, §4.8, §4.8.
  • Y. Chen, L. Li, L. Yu, A. E. Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu (2020) Uniter: learning universal image-text representations. In ECCV, Cited by: §1, §2, §2, §3.2, §4.2, §4.4.
  • A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov (2019) Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116. Cited by: §3.1, §4.3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §2.
  • X. Dong, X. Zhan, Y. Wu, Y. Wei, X. Wei, M. Lu, and X. Liang (2021) M5product: a multi-modal pretraining benchmark for e-commercial product downstream tasks. arXiv preprint arXiv:2109.04275. Cited by: §1.
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §4.4.
  • Z. Dou, Y. Xu, Z. Gan, J. Wang, S. Wang, L. Wang, C. Zhu, Z. Liu, M. Zeng, et al. (2021) An empirical study of training end-to-end vision-and-language transformers. arXiv preprint arXiv:2111.02387. Cited by: §1, §2, §3.2, §4.3.
  • D. Gao, L. Jin, B. Chen, M. Qiu, P. Li, Y. Wei, Y. Hu, and H. Wang (2020) Fashionbert: text and image matching with adaptive loss for cross-modal retrieval. In SIGIR, Cited by: §1, §4.2, §4.5, Table 3.
  • K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick (2021)

    Masked autoencoders are scalable vision learners

    arXiv preprint arXiv:2111.06377. Cited by: §2, §3.2.
  • X. Hu, X. Yin, K. Lin, L. Wang, L. Zhang, J. Gao, and Z. Liu (2021) Vivo: surpassing human performance in novel object captioning with visual vocabulary pre-training. In AAAI, Cited by: §2.
  • Z. Huang, Z. Zeng, B. Liu, D. Fu, and J. Fu (2020) Pixel-bert: aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849. Cited by: §2.
  • C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. Le, Y. Sung, Z. Li, and T. Duerig (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, Cited by: §2, §2, §4.2.
  • W. Kim, B. Son, and I. Kim (2021) Vilt: vision-and-language transformer without convolution or region supervision. In ICML, Cited by: §1, §2, §2, §3.2, §4.3.
  • R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV. Cited by: §2.
  • G. Li, N. Duan, Y. Fang, M. Gong, and D. Jiang (2020a) Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. In AAAI, Cited by: §2.
  • J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi (2021) Align before fuse: vision and language representation learning with momentum distillation. NeurIPS. Cited by: §1, §2, §2, §3.2, §3.2, §4.2.
  • L. Li, Y. Chen, Y. Cheng, Z. Gan, L. Yu, and J. Liu (2020b) Hero: hierarchical encoder for video+ language omni-representation pre-training. In EMNLP, Cited by: §2.
  • L. H. Li, M. Yatskar, D. Yin, C. Hsieh, and K. Chang (2019) Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557. Cited by: §2, §4.4.
  • W. Li, C. Gao, G. Niu, X. Xiao, H. Liu, J. Liu, H. Wu, and H. Wang (2020c) Unimo: towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409. Cited by: §2.
  • X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, et al. (2020d) Oscar: object-semantics aligned pre-training for vision-language tasks. In ECCV, Cited by: §2.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014)

    Microsoft coco: common objects in context

    In ECCV, Cited by: §2.
  • J. Lu, D. Batra, D. Parikh, and S. Lee (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. NeurIPS. Cited by: §1, §2, §2, §3.2.
  • D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. Van Der Maaten (2018) Exploring the limits of weakly supervised pretraining. In ECCV, Cited by: §3.1.
  • A. Miech, J. Alayrac, I. Laptev, J. Sivic, and A. Zisserman (2021) Thinking fast and slow: efficient text-to-visual retrieval with transformers. In CVPR, Cited by: §3.2, §4.2.
  • V. Ordonez, G. Kulkarni, and T. Berg (2011) Im2text: describing images using 1 million captioned photographs. NeurIPS. Cited by: §2.
  • B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, Cited by: §2.
  • A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. In ICML, Cited by: §2, §2, §4.2.
  • N. Rostamzadeh, S. Hosseini, T. Boquet, W. Stokowiec, Y. Zhang, C. Jauvin, and C. Pal (2018) Fashion-gen: the generative fashion dataset and challenge. arXiv preprint arXiv:1806.08317. Cited by: §4.5, Table 3.
  • P. Sharma, N. Ding, S. Goodman, and R. Soricut (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, Cited by: §2.
  • A. Singh, V. Goswami, V. Natarajan, Y. Jiang, X. Chen, M. Shah, M. Rohrbach, D. Batra, and D. Parikh (2020) MMF: a multimodal framework for vision and language research. Note: Cited by: §4.
  • W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai (2019) Vl-bert: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530. Cited by: §2.
  • C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid (2019) Videobert: a joint model for video and language representation learning. In ICCV, Cited by: §2.
  • H. Tan and M. Bansal (2019) Lxmert: learning cross-modality encoder representations from transformers. In EMNLP, Cited by: §1, §2, §3.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. NeurIPS. Cited by: §1, §2.
  • J. Wang, X. Hu, Z. Gan, Z. Yang, X. Dai, Z. Liu, Y. Lu, and L. Wang (2021a) UFO: a unified transformer for vision-language representation learning. arXiv preprint arXiv:2111.10023. Cited by: §2, §4.6.
  • W. Wang, H. Bao, L. Dong, and F. Wei (2021b) VLMo: unified vision-language pre-training with mixture-of-modality-experts. arXiv preprint arXiv:2111.02358. Cited by: §2, §4.6.
  • Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao (2021c) Simvlm: simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904. Cited by: §2, §2.
  • C. Wei, H. Fan, S. Xie, C. Wu, A. Yuille, and C. Feichtenhofer (2021) Masked feature prediction for self-supervised visual pre-training. arXiv preprint arXiv:2112.09133. Cited by: §2, §3.2.
  • Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu (2021) Simmim: a simple framework for masked image modeling. arXiv preprint arXiv:2111.09886. Cited by: §2, §3.2.
  • X. Yan, I. Misra, A. Gupta, D. Ghadiyaram, and D. Mahajan (2020) Clusterfit: improving generalization of visual representations. In CVPR, Cited by: §1, §3.2, §3.2, §4.1.
  • H. You, L. Zhou, B. Xiao, N. C. Codella, Y. Cheng, R. Xu, S. Chang, and L. Yuan (2021) MA-clip: towards modality-agnostic contrastive language-image pre-training. Cited by: §4.6.
  • F. Yu, J. Tang, W. Yin, Y. Sun, H. Tian, H. Wu, and H. Wang (2020) Ernie-vil: knowledge enhanced vision-language representations through scene graph. arXiv preprint arXiv:2006.16934. Cited by: §2.
  • L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg (2016) Modeling context in referring expressions. In ECCV, Cited by: §2.
  • R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi (2019) From recognition to cognition: visual commonsense reasoning. In CVPR, Cited by: §2.
  • P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, and J. Gao (2021)

    Vinvl: revisiting visual representations in vision-language models

    In CVPR, Cited by: §2.
  • L. Zhou, H. Palangi, L. Zhang, H. Hu, J. Corso, and J. Gao (2020) Unified vision-language pre-training for image captioning and vqa. In AAAI, Cited by: §2, §2.
  • Y. Zhu, H. Zhao, W. Zhang, G. Ye, H. Chen, N. Zhang, and H. Chen (2021) Knowledge perceived multi-modal pretraining in e-commerce. In ACM-MM, Cited by: §1.
  • M. Zhuge, D. Gao, D. Fan, L. Jin, B. Chen, H. Zhou, M. Qiu, and L. Shao (2021) Kaleido-bert: vision-language pre-training on fashion domain. In CVPR, Cited by: §1, §4.2, §4.5, Table 3.