Nowadays, self-supervised learning has driven the rapid development of fields such as computer vision and natural language processing, as well as research on multi-modal representation learning. Given the prevalence of online shopping in daily life, multi-modal pre-training on E-commercial products has received increasing attention and led the development trends of next-generation technology for several downstream tasks (e.g., multi-modal retrieval, multi-modal classification and clustering). Recent progress in multi-modal pre-training, from both theoreticalverify and practical singlmodal1 ; singlmodal2 perspectives, proves that large scale samples with diverse modalities can effectively enhance the discrimination of generated features and thus improve the performance in the vision-language tasks.
A large-scale, diverse and cleanly annotated dataset is instrumental for elevating the performance of a multi-modal pre-training network. However, the research is restrained due to the following reason: 1) The dataset scale is comparatively smaller, and the modality and category generally lack diversity. The existing product datasets only provide a limited amount of training data and categories, and some of them do not even include category information. For instance, in the largest public generic Conceptual Caption dataset CC , there are no category information provided and only text and image are included. Meanwhile, in several product datasets (e.g. RPC checkout RPC (200 categories), Dress Retrieval dress (no categories) and Product1M Product1M (458 categories)), the category number is limited, which cannot cover more categories for the performance verification of downstream tasks. 2) Leveraging unlabeled data in more diverse domains and modalities beyond images and text has been barely explored. Models trained on the available generic multi-modal datasets COCO ; ImageNet ; Labelme for E-commerce products are always sub-optimal due to the domain gap. Although there are some E-commerce product datasets, these data are only specialized on the fashion domain dress ; Deepfashion ; Deepfashion2 . Given this circumstance, it is difficult when more categories of products are added to the inventory, leading to quick performance degradation.
More importantly, the current research community mainly focuses on the text-image two modalities for multi-modal pre-training, ignoring the importance of complementary information from other modalities, which leads to limited performance improvements. How to perform multi-modal training with more diverse modality data beyond image and text has been barely explored and an open question. For example, the table gives detailed property and attribute information, e.g. brand, materials, attribute and scenarios. And product videos and audios can illustrate different viewpoints, scales, affordance, selling points, characteristics and application scenarios which cannot be revealed solely from images.
To address the problem of scale insufficiency and modality diversity and bridge the gap with real-world scenarios, we present a very large-scale E-commerce multi-modal product dataset M5Product, which is the largest and most diverse multi-modal product dataset so far. Our M5Product dataset contains more than 6 million multi-modal samples and 6,232 categories, which is twice larger than the largest public image-text dataset Conceptual Caption CC , and has more complex and diverse modalities than existing datasets. Figure 1 shows the presented five modalities (image, caption, video, audio and table) in our dataset. Each sample in M5Product is crawled from the E-tail websites. We annotate all samples based on the image contents and text descriptions. To facilitate fine-grained recognition, we annotate the fine-grained categories of a subset of our M5Product that contains one million cosmetic samples. The comprehensive comparison between our M5Product dataset and other widely used multi-modal pre-training datasets are shown in Table 2.
To investigate how the number of modalities can influence the performance of a self-supervised pre-training model, we propose a generic framework that taking five modalities data as inputs, as shown in Figure 2. In more detail, we first propose a multi-modal five stream pre-training model named Multi-M
odel Transformer (M5-MMT) for several downstream tasks and compare with several vision-language modelsCLIP ; ViLBERT as the baseline methods only using both image and text modalities. To verify the advantages of using more diverse data for training, we also compare the retrieval performance of our M5-MMT using different number of modalities. The overview of our benchmark is shown in Figure 2. The results show the superiority over the baselines that adopt partial modalities. Additionally, we train and verify these models mentioned above and give several observations on four real-world downstream tasks for E-commerce products by analyzing these results. These downstream tasks include multi-modal retrieval, fine-grained retrieval, multi-modal classification and clustering. These observations can be summarized from our extensive experiments as follows:
For multi-modal pre-training models on the E-commerce domain, dataset scale and diversity are relatively important for the downstream tasks.
Our M5-MMT with hybrid-streams for learning semantic alignment from five modalities on M5Product shows benefits over two-modalities models.
In the large-scale and complex scenarios, the modal complementary gain between different modalities is more obvious.
Contributing to cross-modal contrastive learning for improving the discrimination of learned features, our M5-MMT shows superior performance over other baseline methods.
2 Related Work
Multi-modal pre-training dataset. Most multi-modal pre-training datasets are collected from social websites (e.g. twitter and facebook) with just two modalities for specified tasks. These datasets can be divided into four categories according to their modality composition, i.e., audio/text, video/text, image/text and others. LJ Speech ljspeech17 and SQuAD SquAD are the classical audio/text datasets for the voice synthesis and audio QA. Similarly, most video/text datasets TVQA ; MovieQA ; TGIF ; AVSD ; Youcook2 ; VATEX ; MSRVTT ; HowTo100M are mainly for video QA with limited number of samples. Currently, image/text datasets CC ; SBU ; VG ; COCO ; Flickr ; NLVR2 ; VQA ; RPC ; twitter100k ; INRIA ; nuswide ; OpenImage are widely used for pretraining multi-modal models. The Conceptual Caption CC
with more than three million image-text pairs is the most widely used pre-training dataset. MS COCOCOCO , Flickr30K Flickr , INRIA-Websearch INRIA and NUS-WIDE nuswide with standard annotations are often used in the multi-modal and cross-modal retrieval tasks. Other datasets include CMU-MOSEI cmumosei and XMedia XMedia , where CMU-MOSEI mainly focuses on the emotional analysis and XMedia is utilized for cross-modal retrieval.
Recently, with the rapid development of E-Tailing, several product-oriented multimodal benchmarks are proposed to improve the performance of several key challenges such as similar product recommendations and visual or text product search. The Dress Retrieval dress , RPC checkout RPC and Product1M Product1M are typical E-commerce multi-modal datasets. The Dress Retrieval has 20,200 samples on the clothing category, RPC checkout offers 30,000 samples with simple backgrounds in retail small goods and Product1M provides 1.18 million samples with 458 cosmetics classes. Compared with these three datasets, our M5Product is not only larger in terms of categories and data scale, but also contains a more diverse set of modalities. The detailed comparisons with other multi-modal pre-training datasets are in Table 2.
Multi-modal pre-training for E-commerce products. In the last few years, many vision-language pre-trained models have been explored for visual-text multi-modal learning, which can be coarsely divided into two categories: 1) Single-stream models whose transformer layer operates collectively on the concatenation of visual and text inputs, e.g, VL-bert VLbert , Image-BERT ImageBert , VideoBERT VideoBert and HERO HERO . 2) Dual-stream models whose image and text inputs are not concatenated. ViLBERT ViLBERT , LXMERT LXMERT and CLIP CLIP are the most typical methods.
Nowadays, there are several studies for the fashion-based task such as FashionBERT Fashionbert , MAAF MAAF , Kaleido-BERT Kaleido-BERT and CAPTURE Product1M . FashionBERT Fashionbert is the first study related to E-commerce in the fashion domain. MAAF MAAF aims to derive a modality-agnostic attention fusion strategy to address the undifferentiated text and image query task. Kaleido-BERT Kaleido-BERT introduces a novel Kaleido strategy for fashion cross-modality representations from transformers. CAPTURE Product1M proposes a novel contrastive learning framework via a hybrid-stream transformer for multi-product retrieval. All existing studies in the E-commerce scenarios focus solely on the image and text modalities. There is no benchmark method that handles so many modalities to date. Our proposed benchmark can fill this gap fully using diverse modalities on the M5Product dataset to facilitate the product multi-modal pre-training research.
|Modality||Appearance||Usage||Specification||Selling Point||Production||Material||Category Descriptions|
Data Collections. We have been authorized by Alibaba Group to access and obtain this data. The detailed license is given in the Sec. B in the supplementary material. The data is crawled from Taobao website111 https://tb.alicdn.com/snapshot/index.html. We analyze the front page of each E-commerce product and crawl the download information of product images, captions, videos and specifications. Then, we utilize this information to download data of four modalities including the image, caption, video and table (product specifications) and remove the duplicated data. After downloading, audio information is extracted via the moviepy 222https://pypi.org/project/moviepy/ tools saved in the mp3 format. For product specifications, we respectively extract 5,679 product properties and 24,398,673 values to construct a table database coarsely labeled by e-commerce merchants. Finally, 6,313,067 data is obtained. In the data collection process, about 1 data are not paired due to the invalid download link. Hence, our M5Product is not a complete pairing dataset compared with traditional multi-modal datasets in a strict sense. After data collections, we summarize the characteristics of the different modalities data in our M5Product data in Table 1.
|LJ Speech ljspeech17||13,100||-||-||2||audio/text||no|
|Conceptual Caption CC||3,300,000||-||-||2||image/text||no|
|Visual Genome VG||108,000||-||-||2||image/text||no|
|RPC checkout RPC||30,000||200||367,935||2||image/text||no|
|Open Image OpenImage||1,670,000||-||-||2||image/text||no|
|Dress Retrieval dress||20,200||50||20,200||2||image/text||yes|
Data Format. Image data contain 6,313,067 products uploaded by 1,000,517 merchants. Each product has at least five product images, where the first image is the main image that gives the detailed overview of a product and the rest of them depict its functionalities or characteristics. We pick all the main images to construct the dataset. Images of different sizes are divided into three groups representing different qualities. Caption data are provided by 1,000,517 merchants. It is a common case that the text descriptions do not always match well with other modalities due to the fraud. According to the fraud level, the caption data also can be split into three types: well-matched, partially-matched and poorly-matched. Video data are used to showcase products’ usage and characteristics to customers. In our dataset, these videos are recorded at a speed of 24 frames per second (FPS). We further sample those original frames and select one frame per second, since adjacent frames are similar and redundant and could give rise to excessive computational burden. Audio data are extracted from the video data. We extract the corresponding audio information of all sampled video frames. Then the audio frames are transformed into spectrogram by Mel-Frequency Cepstral Coefficients (MFCC)2005Combining . We set the frame size and hop size as 1,024 and 256 respectively. Tabular data are a special kind of database recording some additional product characteristics such as appearances, purposes and producer. The tabular data is indexed by the product ID and collected from the whole product database. There are 5,679 property information and more than 24,398,673 unique values.
Dataset split. The M5Product dataset is splited into train, gallery-c, query-c, gallery-fg and query-fg sets. The train set contains 4,423,160 samples with 3,593 classes. The gallery-c set and query-c set are used for coarse retrieval task while gallery-fg set and query-fg set are used for fine-grained retrieval task. The difference lies in the category level of the annotation labels. In the query-fg and gallery-fg sets, all products of each category belong to the same product, such as the IPHONE 11 Black, while in a coarse-grained set, each category contains several different products, such as dresses of different styles. During constructing fine-grained set, we extracted all cosmetics categories for fine-grained annotation, and finally obtained 1,991 query-fg samples and 117,858 gallery-fg samples. query-c and gallery-c sets contain 24,410 and 1,197,905 samples respectively, among which 249,614 samples in gallery-c set are matched with samples in query-c set, and 948,291 samples are not matched with samples in query-c. These unmatched samples are added to the gallery-c set to increase the difficulty of the retrieval task. For the classification task, we selected 1,797 categories with a total of 40,000 samples for fine-tune and 4,040 samples for testing.
Diversity analysis. We give the quantitative analysis of our M5Product dataset from the perspective of both modalities and categories. About 5 of products are not paired with other modalities, e.g. some data contains only images, captions and tabular properties. The category distribution is shown in Figure 3. And the diversity of modalities and categories is shown in Figure 4. From the figure, we can find that more than 6,000 classes are included in our M5Product, covering various and massive amounts of the E-commerce products such as clothes, cosmetics, instruments and so on.
Quality Analysis. We give a fair comparison between our M5Product dataset and other datasets in Table 2. Compared with the existing multi-modal datasets, M5Product is the first extremely large public real-world E-commerce product dataset that contains data of the most modalities. In our M5Product dataset, images, captions, videos, audios and tables these five kinds of data are mixed across different senses and data structures. Moreover, our dataset also has a large amount of instances, i.e., more than six million samples and 6,232 coarse categories. These abundant data well benefit several downstream tasks such as self-learning, weakly-supervised learning, multi-modal retrieval, cross-modal generation and fine-grained recognition. Especially, more modalities and instance number can help researchers expand their fields from reading to listening and objection to intuition.
Evaluation Metrics.Evaluation metrics are essential for fair comparisons of different methods on the downstream tasks. In our paper, we mainly focus on the three kinds of tasks (product retrieval, classification and clustering) at the feature level. For product retrieval, we adopt the widely used metrics mean Average Precision (mAP) and Precision retrieval_method1 ; retrieval_method2 ; retrieval_method3 to evaluate the retrieval accuracy on two sub-task coarse and fine-grained retrieval. For product classification and clustering, all methods are evaluated by Classification Precision (Classification accuracy), Clustering Accuracy (ACC), Normalized Mutual Information (NMI) NMI and Purity.
4 Benchmark for E-commerce multi-modal pre-training
4.1 M5-MMT and Detailed Implementation.
M5-MMT model. There are no existing models that can be directly applicable for handling all five modalities simultaneously. To overcome the problem, we proposal a new multi-modal pre-training model that is capable of processing five modality inputs in real world scenarios, named as M5-MMT, and plot the detailed architecture of M5-MMT in Figure 6. From the figure, M5-MMT contains two modules (Separate Modality Encoder and Multi-modal Fusion Encoder) to achieve semantic alignment and joint learning of multi-modal inputs. Specifically, Separate Modality Encoder consists of five transformer encoders for different modalities including text, image, video, table and audio. The text encoder and table encoder are standard transformers to encode the product captions and table data separately. Image encoder takes proposals extracted by bottom-up-attention bottom_up as inputs. Video encoder processes ordinal frames sampled from the input video as input data. For audio encoder, M5-MMT encodes MFCC2005Combining features from audio. Followed by Separate Modality Encoder, a contrastive loss is applied for semantic alignment. For another module, Multi-modal Fusion Encoder is achieved by cross attention transformers, including text-image, text-table, text-video and text-audio cross attention modules. Each cross attention transformer learns the inter-modal relations between one modality and other modalities by exchanging key-value pairs in the multi-headed attention mechanism. For improving the learning capability of M5-MMT model, we also apply several mask-based pretext tasks to the multi-modal fusion encoder. For modality-wise feature learning, we adopt masked multi-modal modeling tasks, include Mask Language Modeling task (MLM), Mask Region Prediction task (MRP), Mask Entity Modeling task (MEM), Mask Frame Prediction task (MFP) and Mask Audio Modeling task (MAM). For all masking tasks, 15% of the inputs are masked out and the remaining inputs are used to reconstruct the masked information. Please note that for the MEM task, 15% of the entities such as properties or values are entirely masked, which makes our model better encode the table information for recovering masking inputs. The main differences of our M5-MMT in comparison with existing multi-modal transformer models can be summarized as: 1) The M5-MMT model can simultaneously support partial and five modalities for simultaneous training; 2) Our M5-MMT is the first model applying multi-modal contrastive learning to five modalities to date.
Training and Downstream Tasks. To verify the effectiveness of modality diversity, we separately train M5-MMT model under the different number of modalities and observe the variations in performance. Meanwhile, we compare image-text versions of M5-MMT with image-text pre-training baselines, ViLBERT ViLBERT and CLIP CLIP
. We also provide two variances of modality completion to solve the missing and incomplete modality training problem. All models are evaluated with same metrics mentioned in Sec.3 for four downstream tasks including multi-modal retrieval task, fine-grained retrieval task, multi-modal classification and clustering task. In our dataset M5Product, the definitions of four downstream tasks are shown as follows. The multi-modal retrieval task aims to search the most relevant target products using combinations of two or more modalities. Similarly, the fine-grained retrieval task is defined to use single or multiple modalities to match the most associated products in the instance level. The multi-modal classification and clustering tasks utilize the extracted multi-modal features from the pre-training model to achieve product classification and clustering.
Implementation Details. We implement five multi-modal pre-training methods for image and text modalities: Image-based, Text-based, ViLBERT ViLBERT and CLIP CLIP and our proposed M5-MMT. We use the BERTBert to initialize the linguistic transformer of our M5-MMT. The transformer layer number of Separate Modality Encoder and Multi-modal Fusion Encoder are both set to 6, which adds up to 12 transformer layers. We set the hidden state size of each modality transformer and other baselines to 768 for a fair comparison. The maximum sequence length of caption and table is set to 36 and 64 separately. We train M5-MMT
with a total batch size of 64 for 5 epochs and use Adam optimizerAdam with a warm-up learning rate of 1-4. To overcome the multi-modal pre-training with the missing and incomplete modalities problems, we also provide two simple baseline methods, and train and test both methods using five modalities on the MProduct dataset. Specifically, the first method complete directly deletes the samples with incomplete modalities during the training process and the second method complete refers to the existing modalities to crawl missing modalities for each sample using websearch. More details have been released at 333 https://xiaodongsuper.github.io/M5Product_dataset.
We first train the model on the training split and then apply the pre-trained model to extract the modality features of the gallery and test splits for product retrieval, classification and clustering tasks.
4.2 Modality Diversity
We examine the performance of our proposed M5-MMT in the different number of modalities to verify the effectiveness of modality diversity in multi-modal pre-training. Specially, we train M5-MMT on the training split and report the mAP and Prec under the different number of matching results for the multi-modal retrieval task. Meanwhile, we also provide the experimental results at coarse- and fine-grained levels on our M5Product dataset. For the coarse- and fine-grained retrieval, we mainly focus on the category-level and instance-level groundtruth respectively. The performance on the test split is reported in Table 3, 4 and 5.
Coarse multi-modal retrieval. Multi-modal retrieval have been explored for E-commerce product recommendation and product search. Our experiments in Table 3 and 4 demonstrate the retrieval performance is improved significantly with the increase of modality diversity. From Table 3, we can observe that modality complementary mechanism helps to combine five modality feature representations and improve the retrieval performance up to 3 compared with the simple baseline (denoted as Text). It is mainly because different modalities can capture different views of semantic information of the same products.
Fine-grained multi-modal retrieval. We also give the performance results on the subset with fine-grained annotations. Experiments in Table 5 also show that multi-modal diversity is a beneficial factor for improving the multi-modal retrieval performance. Meanwhile, we can find that the performance of our M5-MMT on the coarse-grained set is worse than that on the fine-grained subset. It is because the whole M5Product has more complex backgrounds and more categories, which is difficult to be matched with the corresponding target.
The modality correlation is defined as the average cosine similarity between image and text. We calculate the modality correlation of multi-modal models trained using different number of modalities, and plot Figure6 to show the variations of modality correlations with the number of modalities. We can observe that as the number of modalities grows, the semantic alignment capability of the pre-training model becomes greater.
4.3 Multi-modal Downstream Tasks
. We find that the multi-modal data can significantly improve the accuracy of product retrieval due to the modality complementarity. We also observe that text-based retrieval is more advantageous than image-based retrieval. It is probably because text data could provide more detailed and direct information.
Multi-modal product classification and Clustering. Experiments in Table 9 show that our proposed method M5-MMT improves the classification and clustering results using large-scale unlabeled data. For classification and clustering tasks, we separately utilize a linear classification with 1
-4 learning rate and a typical K-means algorithm with a total label number of 1,792 to achieve both tasks. For Image/Text-based methods, image/text features are fed into the classification model, while for image-text methods, image and text features are concated into the classification model.
Our M5-MMT is superior in semantic learning compared with other baseline methods. We also reach the same conclusion that the performance of single-modal models is lower than that of multi-modal ones. Besides, there is still much room for performance improvement on the classification and clustering tasks, which verifies the difficulty level of our M5Product dataset.
Modality completion. Results in Table 8 shows that the adopted modality completion method has shown superior performance in handling the possible modality missing problem in real-world data. Our M5-MMT using the mechanism complete and complete can achieve comparative performance with large-scale multi-modal training compared with the results using five modalities recorded in Table 3. We also observe that the mechanism complete obtains some improvements over complete. It is mainly because the modalities crawled from websearch could provide additional correct information to fill in the missing modalities.
In this paper, we present the M5Product dataset, which is the largest E-commerce product multi-modal dataset in the multi-modal pre-training task. To facilitate the multi-modal research in retail and increase seller and buyer engagement and conversions for E-commerce, we also provide a benchmark with different network architectures for multi-modal pre-training and several comparisons for multi-modal retrieval, classification and clustering tasks. In the future, we will expand our dataset for fine-grained multi-modality pretraining, which supports more modalities (e.g. product description paragraphs and user comments) and more downstream tasks, e.g. image and caption generation.
-  Yu Huang, Chenzhuang Du, Zihui Xue, Xuanyao Chen, Hang Zhao, and Longbo Huang. What makes multimodal learning better than single (provably). arXiv preprint arXiv:2106.04538, page 1, 2021.
-  Jack Hessel and Lillian Lee. Does my multimodal model learn cross-modal interactions? it’s harder to tell than you might think! In EMNLP, pages 861–877, 2020.
-  Tao Zhou, Mingxia Liu, Huazhu Fu, Jun Wang, Jianbing Shen, Ling Shao, and Dinggang Shen. Deep multi-modal latent representation learning for automated dementia diagnosis. In MICCAI, pages 629–638, 2019.
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut.
Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning.In ACL, pages 2556–2565, 2018.
-  Xiu-Shen Wei, Quan Cui, Lei Yang, Peng Wang, and Lingqiao Liu. Rpc: A large-scale retail product checkout dataset. arXiv preprint arXiv:1901.07249, page 1, 2019.
Charles Corbiere, Hedi Ben-Younes, Alexandre Ramé, and Charles Ollion.
Leveraging weakly annotated data for fashion image retrieval and label prediction.In ICCV Workshops, pages 2268–2274, 2017.
-  Xunlin Zhan, Yangxin Wu, Xiao Dong, Yunchao Wei, Minlong Lu, Yichi Zhang, Hang Xu, and Xiaodan Liang. Product1m: Towards weakly supervised instance-level product retrieval via cross-modal pretraining. In ICCV, page 1, 2021.
-  Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. In ECCV, pages 740–755, 2014.
-  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
-  Bryan C. Russell, Antonio Torralba, Kevin P. Murphy, and William T. Freeman. Labelme: A database and web-based tool for image annotation. Int. J. Comput. Vis., 77(1-3):157–173, 2008.
-  Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In CVPR, pages 1096–1104, 2016.
Yuying Ge, Ruimao Zhang, Xiaogang Wang, Xiaoou Tang, and Ping Luo.
Deepfashion2: A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images.In CVPR, pages 5337–5345, 2019.
-  Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
-  Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NIPS, pages 13–23, 2019.
-  Keith Ito and Linda Johnson. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017.
-  Chia-Hsuan Li, Szu-Lin Wu, Chi-Liang Liu, and Hung-yi Lee. Spoken squad: A study of mitigating the impact of speech recognition errors on listening comprehension. In ISCA, pages 3459–3463, 2018.
-  Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L. Berg. TVQA: localized, compositional video question answering. In EMNLP, pages 1369–1379, 2018.
-  Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Movieqa: Understanding stories in movies through question-answering. In CVPR, pages 4631–4640, 2016.
-  Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. TGIF-QA: toward spatio-temporal reasoning in visual question answering. In CVPR, pages 1359–1367, 2017.
-  Huda AlAmri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K. Marks, Chiori Hori, Peter Anderson, Stefan Lee, and Devi Parikh. Audio visual scene-aware dialog. In CVPR, pages 7558–7567, 2019.
-  Luowei Zhou, Chenliang Xu, and Jason J. Corso. Towards automatic learning of procedures from web instructional videos. In AAAI, pages 7590–7598, 2018.
-  Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In ICCV, pages 4580–4590, 2019.
-  Jun Xu, Tao Mei, Ting Yao, and Yong Rui. MSR-VTT: A large video description dataset for bridging video and language. In CVPR, pages 5288–5296, 2016.
-  Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, pages 2630–2640, 2019.
-  Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. Im2text: Describing images using 1 million captioned photographs. In NIPS, pages 1143–1151, 2011.
-  Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis., 123(1):32–73, 2017.
-  Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguistics, 2:67–78, 2014.
-  Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. In ACL, pages 6418–6428, 2019.
-  Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In ICCV, pages 2425–2433, 2015.
-  Yuting Hu, Liang Zheng, Yi Yang, and Yongfeng Huang. Twitter100k: A real-world dataset for weakly supervised cross-media retrieval. IEEE Trans. Multim., 20(4):927–938, 2018.
Josip Krapac, Moray Allan, Jakob J. Verbeek, and Frédéric Jurie.
Improving web image search results using query-relative classifiers.In CVPR, pages 1094–1101, 2010.
-  Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. NUS-WIDE: a real-world web image database from national university of singapore. In CIVR, page 1, 2009.
-  Open images dataset. https://storage.googleapis.com/openimages/web/index.html/, 2018.
-  Amir Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In ACL, pages 2236–2246, 2018.
-  Yuxin Peng, Xin Huang, and Yunzhen Zhao. An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges. IEEE Trans. Circuits Syst. Video Technol., 28(9):2372–2385, 2018.
-  Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. VL-BERT: pre-training of generic visual-linguistic representations. In ICLR, page 1, 2020.
-  Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and Arun Sacheti. Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966, page 1, 2020.
-  Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. In ICCV, pages 7463–7472, 2019.
-  Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. HERO: hierarchical encoder for video+language omni-representation pre-training. In EMNLP, pages 2046–2065, 2020.
-  Hao Tan and Mohit Bansal. LXMERT: learning cross-modality encoder representations from transformers. In EMNLP, pages 5099–5110, 2019.
-  Dehong Gao, Linbo Jin, Ben Chen, Minghui Qiu, Peng Li, Yi Wei, Yi Hu, and Hao Wang. Fashionbert: Text and image matching with adaptive loss for cross-modal retrieval. In SIGIR, pages 2251–2260, 2020.
-  Eric Dodds, Jack Culpepper, Simao Herdade, Yang Zhang, and Kofi Boakye. Modality-agnostic attention fusion for visual search with text feedback. arXiv preprint arXiv:2007.00145, page 1, 2020.
-  Mingchen Zhuge, Dehong Gao, Deng-Ping Fan, Linbo Jin, Ben Chen, Haoming Zhou, Minghui Qiu, and Ling Shao. Kaleido-bert: Vision-language pre-training on fashion domain. In CVPR, pages 12647–12657, 2021.
-  Ksr Murty and B. Yegnanarayana. Combining evidence from residual phase and mfcc features for speaker recognition. IEEE Signal Processing Letters, 13(1):52–55, 2005.
-  Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Trans. Pattern Anal. Mach. Intell., 35(12):2916–2929, 2013.
-  Yair Weiss, Antonio Torralba, and Robert Fergus. Spectral hashing. In NIPS, pages 1753–1760, 2008.
-  Xiaoqiang Lu, Xiangtao Zheng, and Xuelong Li. Latent semantic minimal hashing for image retrieval. IEEE Trans. Image Process., 26(1):355–368, 2017.
Chengfu Yang and Zhang Yi.
Document clustering using locality preserving indexing and support vector machines.Soft Comput., 12(7):677–683, 2008.
-  Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, pages 6077–6086, 2018.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
-  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, page 1, 2015.
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
Automatic differentiation in pytorch.In NIPS Workshop, page 1, 2017.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
-  Jacob Whitehill, Paul Ruvolo, Tingfan Wu, Jacob Bergsma, and Javier R. Movellan. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In NIPS, pages 2035–2043, 2009.
For all authors…
(a) Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [YES]
(b) Have you read the ethics review guidelines and ensured that your paper conforms to them? [YES]
(c) Did you discuss any potential negative societal impacts of your work? [YES] See Sec. B in supplementary material.
(d) Did you describe the limitations of your work? [YES] See Sec. B in supplementary material.
If you are including theoretical results…
(a) Did you state the full set of assumptions of all theoretical results? [NA]
(b) Did you include complete proofs of all theoretical results? [NA]
If you ran experiments…
(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [YES] Our benchmark and codes are found in https://xiaodongsuper.github.io/M5Product_dataset.
(b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)?[YES] All training configurations are released in https://xiaodongsuper.github.io/M5Product_dataset.
(c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [NO] The performances of our experiments are quite stable with multi runs.
(d) Did you include the amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [YES] See Sec.D of supplementary material.
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…
(a) If your work uses existing assets, did you cite the creators? [YES] See Sec. 4.
(b) Did you mention the license of the assets? [YES] See Sec. B.
(c) Did you include any new assets either in the supplemental material or as a URL? [YES]
(d) Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [YES] The data source of benchmark is from the E-commerce website. We are authorized by the company that owns the website. The authority license can be found in https://xiaodongsuper.github.io/M5Product_dataset/license.html.
(e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [NO] Our dataset do not contain these information or contents.
If you used crowdsourcing or conducted research with human subjects…
(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [YES] For human annotations, we give the instructions in Sec. C in the supplementary material.
(b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [NA]
(c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [YES] See Sec. C in the supplementary material.
B The M5Product dataset
The M5Product dataset, benchmark, data format and instructions is published at our website https://xiaodongsuper.github.io/M5Product_dataset. Once a conflict of interest, our group reserves all the rights for the final explanation.
License. Our M5Product dataset is released under CC BY-NC-SA 4.0 license. Every people can use it for non-fit purposes. More detailed contents can be found in https://xiaodongsuper.github.io/M5Product_dataset/license.html gives the detail of our dataset and the method how to use it.
Dataset documentation. https://xiaodongsuper.github.io/M5Product_dataset/documentation.html gives the detail of our dataset and the method how to use it.
Benchmark. The benchmark results are shown in https://xiaodongsuper.github.io/M5Product_dataset/benchmark.html. Our codes are under MIT license. These codes will be released once published.
Dataset Maintenance. The download links of our dataset is provided in https://xiaodongsuper.github.io/M5Product_dataset/download.html. Due to the large storage space 21Tb, we just provide the download link files in Google Drive and BaiduYunPan for all users.
Privacy Policies and Terms. Privacy Policies and Terms and Conditions are given in https://xiaodongsuper.github.io/M5Product_dataset/termofuse.html.
Dataset Limitations. As discussed in Sec. 3, M5Product mainly focuses on the typical modalities (image, text, table, video and audio). For expanding the dataset to provide more diverse modalities, the Spatio-temporal data such as the instance localization and product date will be provided.
C Collection of human baselines
For the product retrieval task, we resort to crow-sourcing to obtain the human annotations. Specifically, we present an image and text matching task with several human annotators. These annotators are asked to select the best match option. In the crowdsourcing system, each matching task is presented to five human workers. For the classification task, these human workers are required to select the best option from the candidate categories. A typical example of our human annotation interface is shown in Figure 7. For each estimated task, the payment of our annotator is 3 cents RMB.
D Implementation Details
Our models are implemented based on Pytorch . To speed up training, we also use Nvidia Apex444https://github.com/NVIDIA/apex for mixed precision training. All models are trained on 4 Nvidia 3090 and 2080ti GPUs on our workstations. We use Adam  to optimize the parameters of our model, with an initial learning rate in 1e-4, and use a linear learning rate decay schedule with temperature parameter 0.1.
The annotation of a query consists of all the matched instances in the gallery split. To address this, we first use the ResNet  and Bert  on the rest subsets except for the train split to extract the embeddings and construct the query candidate pool. Specifically, we oversample the category which has more than 2,000 instances, then we calculate the image and caption fusion similarities between the sampled instances and the rest ones to create a preranking list as candidate pools for mining the labeling cost. The final size of the candidate shortlist for each query is 500, which is about of the whole gallery split. During the crowd-sourced annotation process, human workers review both images and captions in the given query list to select which samples are matched with query instance.
Annotation Rules. It is quite challenging to define whether two images contain the same product when critical aspects are not given in their captions and images. In our annotations, we use product images and their captions as the primary materials for gallery construction. Hence, we define several rules to determine the "same product" condition:
The two images are in different conditions (e.g., backgrounds, angles, etc), but the products in both images are the same.
They should have the same color/model/shape/style, or other features that can be distinguished by human.
The two captions with the same product name can have different descriptions for the same product object.
They have various characteristics and can not be solely identified with the individual feature.
To ensure labeling consistency, each annotation pair is labeled by five human workers in the crowd-sourced platform. In the process, we first make a small dataset from our query list as the Gold Problem to evaluate the annotation capability of each human worker. Based on the labeled results ("Matched" or "Not Matched") from human workers and their annotation capability, we utilize the weighted GLAD  inference algorithm to determine the final accepted labels.