GIT: A Generative Image-to-text Transformer for Vision and Language

by   Jianfeng Wang, et al.

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on 12 challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.


page 18

page 19

page 20

page 25

page 28

page 29

page 35

page 36


UFO: A UniFied TransfOrmer for Vision-Language Representation Learning

In this paper, we propose a single UniFied transfOrmer (UFO), which is c...

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Vision-language pre-training has been an emerging and fast-developing re...

VECO: Variable Encoder-decoder Pre-training for Cross-lingual Understanding and Generation

Recent studies about learning multilingual representations have achieved...

Scaling Up Vision-Language Pre-training for Image Captioning

In recent years, we have witnessed significant performance boost in the ...

PALM: Pre-training an Autoencoding Autoregressive Language Model for Context-conditioned Generation

Self-supervised pre-training has emerged as a powerful technique for nat...

Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training

In this work, we present Auto-captions on GIF, which is a new large-scal...

MLIM: Vision-and-Language Model Pre-training with Masked Language and Image Modeling

Vision-and-Language Pre-training (VLP) improves model performance for do...


Image captioning Image QA Video captioning Video QA
(lr)2-5 (lr)6-8 (lr)9-11 (lr)12-13 [origin=c]90COCO [origin=c]90nocaps [origin=c]90VizWiz [origin=c]90TextCaps [origin=c]90ST-VQA [origin=c]90VizWiz [origin=c]90OCR-VQA [origin=c]90MSVD [origin=c]90MSRVTT [origin=c]90VATEX [origin=c]90MSVD-QA [origin=c]90TGIF-Frame
2*Prior SOTA 138.7 120.6 94.1 109.7 59.7 65.4 64.1 120.6 60 86.5 48.3 69.5
[abs-2101-00529] [yu2022coca] [GongZWCZSL21] [abs-2012-04638] [abs-2012-04638] [alayrac2022flamingo] [HanHH20] [abs-2111-13196] [abs-2201-08264] [abs-2110-05204] [wang2022all] [ZellersLHYPCFC21]
(ours) 148.8 123.4 114.4 138.2 69.6 67.5 68.1 180.2 73.9 93.8 56.8 72.8
+10.1 +3.7 +20.3 +28.5 +9.9 +2.1 +4.0 +59.6 +13.9 +7.3 +8.5 +3.3
Table : New state-of-the-art performance with our  across 12 image/video captioning and question answering (QA) tasks. *: evaluated on the public server. CIDEr scores are reported for Captioning tasks.

Tremendous advances have been made in recent years on vision-language (VL) pre-training, especially based on the large-scale data of image-text pairs, e.g., CLIP [clip], Florence [abs-2111-11432], and SimVLM [wang2021simvlm]. The learned representation greatly boosts the performance on various downstream tasks, such as image captioning [LinMBHPRDZ14], visual question answering (VQA) [GoyalKSBP16], and image-text retrieval. During pre-training, Masked Language Modeling (MLM) and Image-Text Matching (ITM) tasks have been widely used [abs-2012-06946, FangW0WY021, Li0LZHZWH0WCG20, abs-2101-00529, ChenLYK0G0020, abs-2111-02387, ufo, KimSK21]. However, these losses are different from the downstream tasks, and task-specific adaptation has to be made. For example, ITM is removed for image captioning [ufo, Li0LZHZWH0WCG20]

, and an extra randomly initialized multi-layer perceptron is added for VQA 

[wang2021simvlm, Li0LZHZWH0WCG20]. To reduce this discrepancy, recent approaches [cho2021unifying, wang2021simvlm, abs-2111-12085, abs-2202-03052] have attempted to design unified generative models for pre-training, as most VL tasks can be cast as generation problems. These approaches typically leverage a multi-modal encoder and a text decoder with careful design on the text input and the text target. To further push the frontier of this direction, we present a simple Generative Image-to-text Transformer, named , which consists only of one image encoder and one text decoder. The pre-training task is just to map the input image to the entire associated text description with the language modeling objective. Despite its simplicity,  sets new state of the arts across 12 challenging benchmarks with a large margin, as summarized in Table Document.


Figure : Example captions generated by . The model demonstrates strong capability of recognizing scene text, tables/charts, food, banknote, logos, landmarks, characters, products, etc.

The image encoder is a Swin-like vision transformer [DosovitskiyB0WZ21, abs-2111-11432] pre-trained on massive image-text pairs based on the contrastive task [JiaYXCPPLSLD21, clip, abs-2111-11432]. This eliminates the dependency on the object detector, which is used in many existing approaches [00010BT0GZ18, Li0LZHZWH0WCG20, abs-2012-06946, abs-2101-00529, ChenLYK0G0020, FangW0WY021]

. To extend it to the video domain, we simply extract the features of multiple sampled frames and concatenate them as the video representation. The text decoder is a transformer network to predict the associated text. The entire network is trained with the language modeling task. For VQA, the input question is treated as a text prefix, and the answer is generated in an auto-regressive way. Furthermore, we present a new generation-based scheme for ImageNet classification, where the predicted labels come directly from our generative model without pre-defining the vocabulary. The approach is simple, but the performance is surprisingly impressive after we scale up the pre-training data and the model size. Fig. 

Document shows captions generated by the fine-tuned with TextCaps. The samples demonstrate the model’s strong capability of recognizing and describing scene text, tables, charts, food, banknote, logos, landmarks, characters, celebrities, products, etc., indicating that our model has encoded rich multi-modal knowledge about the visual world. Our main contributions are as follows.

  • [leftmargin=*]

  • We present , which consists of only one image encoder and one text decoder, pre-trained on 0.8 billion image-text pairs with the language modeling task.

  • We demonstrate new state-of-the-art performance over 12 tasks on image/video captioning and QA (Table Document), without the dependency on object detectors, object tags, and OCR. On TextCaps, we surpass the human performance for the first time.

  • We present a new scheme of generation-based image classification. On ImageNet-1K, we show a decent performance (88.79% top-1 accuracy) with our .

Related Work

In VL pre-training, multi-task pre-training has been widely used to empower the network with multiple or enhanced capabilities. For example, MLM and ITM are widely adopted pre-training tasks [Li0LZHZWH0WCG20, KimSK21, abs-2101-00529, abs-2012-06946, xue2021probing, LuBPL19, TanB19]. Recently, the image-text contrastive loss has also been added in [yu2022coca, li2021align, ufo]

. Since most VL tasks can be formulated as the text generation task 

[cho2021unifying], a single generation model can be pre-trained to support various downstream tasks. The input and output texts are usually carefully designed to pre-train such a generation model. For example in [cho2021unifying], the text is properly masked as the network input and the goal is to recover the masked text span. SimVLM [wang2021simvlm]

randomly splits a text sentence into the input and the target output. In these methods, a multi-modal transformer encoder is utilized to incorporate the text inputs before decoding the output. For image representation, Faster RCNN has been used in most existing approaches 

[00010BT0GZ18, Li0LZHZWH0WCG20, abs-2012-06946, abs-2101-00529, ChenLYK0G0020, FangW0WY021] to extract the region features. Recently, a growing interest is in dense representation [abs-2004-00849, wang2021simvlm, ufo, KimSK21, abs-2112-05230, abs-2111-02387, li2021align] from the feature map, which requires no bounding box annotations. Meanwhile, it is easy to train the entire network in an end-to-end way. In addition to the representation from the feature map, object tags [Li0LZHZWH0WCG20, abs-2012-06946, abs-2101-00529, abs-2111-12727, abs-2112-05230] are leveraged to facilitate the transformer to understand the context, especially the novel objects. For scene-text-related tasks, OCR is invoked to generate the scene text as additional network input, e.g., in [HuSDR20, abs-2012-04638]. For the text prediction, A transformer network is typically used, which can incorporate the cross-attention module to fuse the image tokens, e.g.[cho2021unifying, alayrac2022flamingo, abs-2111-12085, yu2022coca], or only the self-attention modules where the image tokens are concatenated with the text tokens, e.g.[Li0LZHZWH0WCG20, ChenLYK0G0020, abs-2101-00529, abs-2012-06946, abs-2112-05230].

Generative Image-to-text Transformer


Figure : Network architecture of our , composed of one image encoder and one text decoder. (a): The training task in both pre-training and captioning is the language modeling task to predict the associated description. (b): In VQA, the question is placed as the text prefix. (c): For video, multiple frames are sampled and encoded independently. The features are added with an extra learnable temporal embedding (initialized as 0) before concatenation.

With large-scale image-text pairs, our goal is to pre-train a VL model which is simple yet effective to benefit image/video captioning and QA tasks. As the input is the image and the output is the text, the minimal set of components could be one image encoder and one text decoder, which are the only components of our as illustrated in Fig. Document.

Network Architecture

The image encoder is based on the contrastive pre-trained model [abs-2111-11432]. The input is the raw image and the output is a compact 2D feature map, which is flattened into a list of features. With an extra linear layer and a layernorm layer, the image features are projected into dimensions, which are the input to the text decoder. R0.3 [width=0.3]fig/seq2seq-crop.pdf seq2seq attention mask is applied to the transformer. If (, ) is 1, the -th output can depend on the -th input; otherwise, not. The text decoder is a transformer module to predict the text description. The transformer module consists of multiple transformer blocks, each of which is composed of one self-attention layer and one feed-forward layer. The text is tokenized and embedded into dimensions, followed by an addition of the positional encoding and a layernorm layer. The image features are concatenated with the text embeddings as the input to the transformer module. The text begins with the [BOS] token, and is decoded in an auto-regressive way until the [EOS] token or reaching the maximum steps. The seq2seq attention mask as in Fig. Document is applied such that the text token only depends on the preceding tokens and all image tokens, and image tokens can attend to each other. This is different from a unidirectional attention mask, where not every image token can rely on all other image tokens.


For each image-text pair, let be the image, be the text tokens, be the [BOS] token and be the [EOS] token. We apply the language modeling (LM) loss to train the model. That is, l = 1N + 1∑_i = 1^N + 1 CE(y_i, p(y_i — I, {y_j, j=0, ⋯, i - 1)}), where CE is the cross-entropy loss with label smoothing of 0.1.


For the image captioning task, as the training data format is the same as that in pre-training, we apply the same LM task to fine-tune our . For visual question answering, the question and the ground-truth answer are concatenated as a new special caption during the fine-tuning, but the LM loss is only applied on the answer and the [EOS] tokens. During inference, the question is interpreted as the caption prefix and the completed part is the prediction. To extend to the video domain, we sample multiple frames from each video clip, and encode each frame via the image encoder independently. Afterwards, we add a learnable temporal embedding (initialized as zeros), and concatenate the features from sampled frames. The final representation is used in a similar way as the image representation for captioning and question answering. We also apply our generation model to the image classification task, where the class names are interpreted as image captions, and our is fine-tuned to predict the result in an auto-regressive way. This is different from existing work which normally pre-defines the vocabulary and uses a linear layer (with softmax) to predict the likelihood of each category. This new generation-based scheme is beneficial when new data and new categories are added to the existing dataset. In this case, the network can continuously train on the new data without introducing new parameters.

Relation to Existing Work

The concurrent work of Flamingo [alayrac2022flamingo] leverages a much larger model primarily for zero-shot and few-shot VL taks. Both Flamingo [alayrac2022flamingo] and our contain one image encoder and one text decoder with LM pre-training. In Flamingo, extra cross-attention modules are inserted into the decoder to incorporate the image representations, while we simply concatenate the image representation with the text representation as the input to the decoder. Another difference is that Flamingo freezes the pre-trained weights of image encoder and text decoder, and only tunes the extra cross-attention modules. The benefit is to preserve the generalization capability of the large language model. In our , all parameters are updated to better fit the VL tasks. Another concurrent work of CoCa [yu2022coca] unifies the contrastive task and the generation task. as one pre-training phase. Our approach is equivalent to separating the two tasks sequentially: () using the contrastive task to pre-train the image encoder followed by () using the generation task to pre-train both the image encoder and text decoder.



We collect 0.8B image-text pairs for pre-training, which include COCO 

[LinMBHPRDZ14], Conceptual Captions (CC3M) [SoricutDSG18], SBU [OrdonezKB11], Visual Genome (VG) [KrishnaZGJHKCKL16], Conceptual Captions (CC12M) [changpinyo2021cc12m], ALT200M [abs-2111-12233], and an extra 0.6B data following a similar collection procedure in [abs-2111-12233]. The image encoder is initialized from the pre-trained contrastive model [abs-2111-11432]. The hidden dimension () is 768. The text decoder consists of 6 randomly-initialized transformer blocks. The total number of model parameters is 0.7 billion. The learning rates of the image encoder and the decoder are and

, respectively, and follow the cosine decay to 0. The total number of epochs is

. During inference, the beam size is and the length penalty [WuSCLNMKCGMKSJL16] is 0.6.

Results on Image Captioning

COCO. As a common practice, we use the Karpathy split [KarpathyL15] of COCO [LinMBHPRDZ14] for evaluation. The results with both cross-entropy optimization and SCST [scst] are presented in Table Document. Our model achieves new SOTA performance 1) on all metrics (BLEU@4 [PapineniRWZ02], METEOR [DenkowskiL14], CIDEr [VedantamZP15], SPICE [AndersonFJG16]) with SCST and 2) on CIDEr and BLEU@4 with cross-entropy optimization only. We also evaluate our best fine-tuned model on the test set111 As Table Document shows, our model surpasses the prior SOTA by a large margin (10 points on CIDEr).

2*Method Cross-Entropy SCST
MiniVLM [abs-2012-06946] 35.6 28.6 119.8 21.6 39.2 29.7 131.7 23.5
DistillVLM [FangW0WY021] 35.6 28.7 120.8 22.1 - - - -
ViTCap [abs-2112-05230] 36.3 29.3 125.2 22.6 41.2 30.1 138.1 24.1
OSCAR [Li0LZHZWH0WCG20] 37.4 30.7 127.8 23.5 41.7 30.6 140.0 24.5
VinVL [abs-2101-00529] 38.5 30.4 130.8 23.4 41.0 31.1 140.9 25.2
UFO [ufo] 38.7 30.0 131.2 23.3 - - - -
Flamingo [alayrac2022flamingo] - - 138.1 - - - - -
LEMON [abs-2111-12233] 41.5 30.8 139.1 24.1 42.6 31.4 145.5 25.5
SimVLM [wang2021simvlm] 40.6 33.7 143.3 25.4 - - - -
CoCa [yu2022coca] 40.9 33.9 143.6 24.7 - - - -
OFA [abs-2202-03052] - - - - 43.5 31.9 149.6 26.1
UniversalCap [abs-2111-12727] - - - - 42.9 31.5 150.2 25.2
44.1 31.5 144.8 24.7 44.1 32.2 151.1 26.3
Table : Results on COCO captioning with Karpathy [KarpathyL15] split. The highest number is highlighted in bold.
Method B@4 M R C
BUTD [00010BT0GZ18] 68.5 36.7 72.4 120.5
VinVL [abs-2101-00529] 74.9 40.8 76.8 138.7
78.3 42.0 78.4 148.8
2*Method test-dev test-std
(lr)2-3 (lr)4-5 C S C S
MTMA [GongZWCZSL21] 94.9 19.9 94.1 19.9
113.1 22.2 114.4 22.3
 3*Method Validataion Test
 (lr)2-3 (l)4-5 C S C S
 OSCAR [Li0LZHZWH0WCG20] 83.4 11.4 80.9 11.3
 Human [abs-1812-08658] 87.1 14.2 85.3 14.6
 VIVO [abs-2009-13682] 88.3 12.4 86.6 12.4
 VinVL [abs-2101-00529] 94.3 13.1 92.5 13.1
 UFO [ufo] 94.3 13.6 92.3 13.6
 SimVLM [wang2021simvlm] 115.2 - 115.2 -
 LEMON [abs-2111-12233] 117.3 15.0 114.3 14.9
 UniversalCap [abs-2111-12727] 122.1 15.0 119.3 15.1
 CoCa [yu2022coca] 122.4 15.5 120.6 15.5
  125.5 16.0 123.4 15.9
Table : Results on COCO test set (c40).
Table : Results on VizWiz-Captions [abs-2002-08565].
Table : Results on nocaps. C: CIDEr. S: SPICE.

nocaps. The dataset [abs-1812-08658] is collected from Open Images [OpenImages], which contains a much wider (than COCO) spectrum of novel objects in the wild. We directly employ fine-tuned on COCO, and evaluate it against the validation and test sets. As shown in Table Document, our method improves over the prior SOTA [GongZWCZSL21] significantly, on both validation and test sets. Compared with CoCa [yu2022coca], our model is much smaller in the model size (0.7B vs 2.1B), but achieves higher performance (123.0 vs 120.6 on CIDEr). The results also reveal that our model is capable of identifying novel objects without the object tags.

Method Validation set Test set
(lr)2-6 (l)7-11 B M R S C B M R S C
BUTD [00010BT0GZ18]* 20.1 17.8 42.9 11.7 41.9 14.9 15.2 39.9 8.8 33.8
AoANet [huang2019attention]* 20.4 18.9 42.9 13.2 42.7 15.9 16.6 40.4 10.5 34.6
M4C-Cap. [HuSDR20]* 23.3 22.0 46.2 15.6 89.6 18.9 19.8 43.2 12.8 81.0
Anc.-Cap. [abs-2105-03236] 24.7 22.5 47.1 15.9 95.5 20.7 20.7 44.6 13.4 87.4
TAP [abs-2012-04638] 25.8 23.8 47.9 17.1 109.2 21.9 21.8 45.6 14.6 103.2
TAP [abs-2012-04638] 28.1 24.4 49.3 17.7 119.0 22.9 22.0 46.5 14.6 109.7
Human [abs-2003-12462] - - - - - 24.4 26.1 47.0 18.8 125.5
37.0 27.6 54.1 21.1 143.7 33.1 26.2 52.2 19.6 138.2
Table : Results on TextCaps [abs-2003-12462]. Test set is evaluated by the server. *: the nubmers are from [abs-2003-12462]. B: BLEU@4; M: METEOR; R: ROUGE-L; S: SPICE; C: CIDEr. #: winner entry of the CVPR 2021 workshop challenge 333

TextCaps. The dataset [abs-2003-12462] requires the model to recognize the text and to relate it with its visual context in a natural language description. Existing work typically leverages an OCR system to infer the text and uses a dynamic pointer network or copy mechanism to decide whether the OCR text should be copied into the output. In contrast, we directly fine-tune on the TextCaps training set and evaluate it without OCR input. As shown in Table 3, our solution outperforms the previous SOTA (TAP [abs-2012-04638]) by a breakthrough margin (28.5 points on CIDEr), and also surpasses the human performance for the first time. VizWiz-Captions. The images in VizWiz-Captions [abs-2002-08565] are collected by the visually-impaired people. Table Document shows the fine-tuned results. Our approach significantly outperforms the prior SOTA (MTMA [GongZWCZSL21]) by a large margin (20.3 on CIDEr). Note that the prior SOTA uses extra modules of OCR, object detector, and model ensemble.

Results on Visual Question Answering

The evaluation benchmarks include VQAv2 [GoyalKSBP16], TextVQA [singh2019towards], VizWiz-VQA [Gurari0SGLGLB18]. ST-VQA [biten2019scene], and OCR-VQA [mishra2019ocr]. On VQAv2, most approaches pre-define the answer vocabulary and formulate the task as the classification problem. As presented in Sec. Document, we place the question as the caption prefix, and the completed part is the prediction. On other tasks, most approaches resort to OCR systems to predict the scene text, while we use no OCR input. Before fine-tuning the model, we run an intermediate fine-tuning on the combination of the training data of VQAv2, TextVQA, ST-VQA, OCR-VQA, VizWiz-VQA, Visual Genome QA [KrishnaZGJHKCKL16], GQA [HudsonM19], and OK-VQA [marino2019ok]. To avoid data contamination, we remove the duplicate images of the test and validation set of the target benchmarks. As illustrated in Table Document, we significantly boost the prior SOTA on VizWiz-VQA, ST-VQA, and OCR-VQA. Compared with the concurrent work of Flamingo [alayrac2022flamingo], we achieve higher accuracy (+5.4) on TextVQA and lower (-3.29) on VQAv2. Note that Flamingo’s model size is 80B, which is 114 times of ours (0.7B).

Vocabulary Model test-dev test-std
OSCAR [Li0LZHZWH0WCG20] 73.61 73.82
10*Closed UNITER [ChenLYK0G0020] 73.82 74.02
VILLA [Gan0LZ0020] 74.69 74.87
UNIMO [li2020unimo] 75.06 75.27
ALBEF [li2021align] 75.84 76.04
VinVL [abs-2101-00529] 76.52 76.60
UFO [ufo] 76.64 76.76
CLIP-ViL [abs-2107-06383] 76.48 76.70
METER [abs-2111-02387] 77.68 77.64
OFA [abs-2202-03052] 79.87 80.02
SimVLM [wang2021simvlm] 80.03 80.34
Florence [abs-2111-11432] 80.16 80.36
CoCa [yu2022coca] 82.3 82.3
3*Open BLIP [abs-2201-12086] 78.25 78.32
Flamingo [alayrac2022flamingo] 82.0 82.1
78.56 78.81
(a) VQAv2 [GoyalKSBP16]
Model validation test
M4C [HuSDR20] 40.55 40.46
LaAP-Net [HanHH20] 41.02 41.41
SA-M4C [KantBASPLA20] 45.4 44.6
SMA [abs-2006-00753] 44.58 45.51
TAP [abs-2012-04638] 54.71 53.97
Flamingo [alayrac2022flamingo] 57.1 54.1
Mia [abs-2106-15332] - 73.67
59.93 59.75
(b) TextVQA [singh2019towards]
Model test-dev test
[vizwiz2021winner] 61.8 60.6
Flamingo [alayrac2022flamingo] 65.7 65.4
68.0 67.5
(c) VizWiz-QA [Gurari0SGLGLB18]
Model Val Acc. Val ANLS Test ANLS
M4C [HuSDR20] 38.1 47.2 46.2
LaAP-Net [HanHH20] 39.7 49.7 48.5
SA-M4C [KantBASPLA20] 42.2 51.2 50.4
TAP [abs-2012-04638] 50.8 59.8 59.7
59.2 69.1 69.6
model val test
BLOCK+CNN+W2V [mishra2019ocr] - 48.3
M4C [HuSDR20] 63.5 63.9
LaAP-Net [HanHH20] 63.8 64.1
67.8 68.1
(d) ST-VQA [biten2019scene] (e) OCR-VQA [mishra2019ocr]
Table : Results on visual question answering. (a): for VQAv2, approaches are divided according to whether the answer vocabulary is pre-defined (Closed) or not (Open). The model with closed vocabulary can be a classification model or generation model with constrained outputs, e.g.[abs-2202-03052]. (b): for TextVQA, Mia [abs-2106-15332] is the winner entry of TextVQA Challenge 2021 with a fine-tuned T5-3B [RaffelSRLNMZLL20] model. (c): : winner entry of 2021 VizWiz Grand Challenge Workshop.

Results on Video Captioning and Question Answering

On the video captioning task, the performance is evaluated on MSVD [chen-dolan-2011-collecting] with the widely-used splits from [VenugopalanXDRMS14], MSRVTT [xu2016msr], YouCook2 [zhou2018towards] VATEX [wang2019vatex], and TVC [lei2020tvr]. On VATEX, the performance is evaluated on both the public test and private test (evaluated on the server). The results are shown in Table Document, and we achieve new SOTA on MSRVD, MSRVTT, and VATEX. For example on VATEX private test, our results are even better (93.8 vs 86.5) than CLIP4Caption++ [abs-2110-05204], which relies on model ensemble and additional subtitle input. This is also better than Flamingo [alayrac2022flamingo] (84.2) with 80B parameters.

Method B@4 M R C
SibNet [liu2020sibnet] 54.2 34.8 71.7 88.2
POS+CG [wang2019controllable] 52.5 34.1 71.3 88.7
OA-BTG [zhang2019object] 56.9 36.2 - 90.6
STG-KD [pan2020spatio] 52.2 36.9 73.9 93.0
PMI-CAP [chen2020learning] 54.6 36.4 - 95.1
ORG-TRL [Zhang_2020_CVPR] 54.3 36.4 73.9 95.2
SwinBERT [abs-2111-13196] 58.2 41.3 77.5 120.6
79.5 51.1 87.3 180.2
Method B@4 M R C
STG-KD [pan2020spatio] 40.5 28.3 60.9 47.1
Support-set [patrick2020support] 38.9 28.2 59.8 48.6
PMI-CAP [chen2020learning] 42.1 28.7 - 49.4
ORG-TRL [Zhang_2020_CVPR] 43.6 28.8 62.1 50.9
OpenBook [zhang2021open] 33.9 23.7 50.2 52.9
SwinBERT [abs-2111-13196] 41.9 29.9 62.1 53.8
MV-GPT [abs-2201-08264] 48.9 38.7 64.0 60
53.8 32.9 67.7 73.9
Method B@4 M R C
Vid.BERT [sun2019videobert] 4.3 11.9 - 55.0
ActBERT [zhu2020actbert] 5.4 13.3 - 65.0
SwinBERT[abs-2111-13196] 9.0 15.6 37.3 109.0
Flamingo [alayrac2022flamingo] - - - 118.6
VALUE [VALUE] 12.4 18.8 40.4 130.3
UniVL [Luo2020UniVL] 17.4 22.4 46.5 181
MV-GPT [abs-2201-08264] 21.9 27.1 49.4 221
10.3 17.3 39.8 129.8
(a) MSVD [chen-dolan-2011-collecting] (b) MSRVTT [xu2016msr] (c) YouCook2 [zhou2018towards]
Method B@4 R M C
VaTeX [wang2019vatex] 28.4 47.0 21.7 45.1
OpenBook [zhang2021open] 33.9 50.2 23.7 57.5
VALUE [VALUE] - - - 58.1
SwinBERT [abs-2111-13196] 38.7 53.2 26.2 73.0
C.4Cap. [abs-2110-05204] 40.6 54.5 - 85.7
41.6 55.4 28.1 91.5
Method C
X-L.+T. [zhu2019vatex] 81.4
Flamingo [alayrac2022flamingo] 84.2
C.4Cap. [abs-2110-05204] 86.5
Method B@4 R M C
MMT [lei2020tvr] 10.8 32.8 16.9 45.3
HERO [li2020hero] 12.3 34.1 17.6 49.9
VALUE [VALUE] 11.6 33.9 17.6 50.5
SwinBERT [abs-2111-13196] 14.5 36.1 18.5 55.4
C.4Cap. [abs-2110-05204] 15.0 36.9 - 66.0
16.2 36.7 18.9 63.0
(d) VATEX [wang2019vatex] public test (e) VATEX [wang2019vatex] private test (f) TVC [lei2020tvr]
Table : Results on video captioning. : model ensemble; : with the subtitle as additional input. YouCook2 and TVC are on the validation set.
Method Accuracy
QueST [JiangCLZG20] 34.6
HCRN [LeLVT21] 36.1
CoMVT [SeoNS21] 42.6
JustAsk [YangMSLS21] 46.3
VIOLET [abs-2111-12681] 47.9
All-in-one [wang2022all] 48.3
Method Accuracy
JustAsk [YangMSLS21] 41.5
MV-GPT [abs-2201-08264] 41.7
MERLOT[ZellersLHYPCFC21] 43.1
VIOLET [abs-2111-12681] 43.9
All-in-one [wang2022all] 46.8
Flamingo [alayrac2022flamingo] 47.4
Method Accuracy
HCRN [LeLVT21] 55.9
QueST [JiangCLZG20] 59.7
ClipBERT[LeiLZGBB021] 60.3
All-in-one [wang2022all] 66.3
VIOLET [abs-2111-12681] 68.9
MERLOT[ZellersLHYPCFC21] 69.5
(a) MSVD-QA [xu2017video, chen-dolan-2011-collecting] (b) MSRVTT-QA [xu2017video, xu2016msr] (c) TGIF-Frame [JangSYKK17]
Table : Results on video question answering. All are open-ended question answering tasks.

r.4 Results on ImageNet-1k classification task. Our approach takes the class name as the caption and predict the label in an auto-regressive way without pre-defining the vocabulary. Vocabulary Method Top-1 2*Closed ALIGN [JiaYXCPPLSLD21] 88.64 Florence [abs-2111-11432] 90.05 CoCa [yu2022coca] 91.0 Open 88.79 Video QA is evaluated on MSVD-QA [xu2017video, chen-dolan-2011-collecting], MSRVTT-QA [xu2017video, xu2016msr], and TGIF-Frame [JangSYKK17], which are all open-ended tasks. As shown in Table Document, our simple solution achieves new SOTA on MSVD-QA and TGIF-Frame with a large margin.

Results on Image Classification

We fine-tune on ImageNet-1k. Each category is mapped to a unique class name, and the prediction is correct only if it is exactly matched with the ground-truth label subject to more or fewer whitespaces444pred.replace(‘ ’, ‘’) == gt.replace(‘ ’, ‘’). As shown in Table , our approach can achieve descent accuracy without pre-defining the vocabulary.

2*Method Fine-tuning Regular Text Irregular Text 2*Average
(lr)3-5 (lr)6-8 Data IC13[karatzas2013icdar] SVT [wang2011end] IIIT [mishra2012scene] IC15[karatzas2015icdar] SVTP[phan2013recognizing] CUTE [risnumawan2014robust]
SAM [liao2019mask] MJ+ST 95.3 90.6 93.9 77.3 82.2 87.8 87.8
Ro.Scanner [yue2020robustscanner] MJ+ST 94.8 88.1 95.3 77.1 79.5 90.3 87.5
SRN [yu2020towards] MJ+ST 95.5 91.5 94.8 82.7 85.1 87.8 89.6
ABINet [fang2021read] MJ+ST 97.4 93.5 96.2 86.0 89.3 89.2 91.9
S-GTR [he2021visual] MJ+ST 96.8 94.1 95.8 84.6 87.9 92.3 91.9
2* TextCaps 94.2 91.5 92.9 78.2 87.1 95.5 89.9
MJ+ST 97.3 95.2 95.3 83.7 89.9 96.2 92.9
Table : Results on scene text recognition. MJ and ST indicate the MJSynth (MJ) [jaderberg2014synthetic, jaderberg2016reading] and SynthText (ST) [gupta2016synthetic] datasets used for training scene text recognition models.

Results on Scene Text Recognition

The task [graves2006connectionist] aims to develop models that can read scene text but without the requirement to describe the context. We evaluate our model in two settings. One is the fine-tuned on TextCaps, which generates an open-ended caption. The model is the same as the one used for TextCaps evaluation in Table 3. The prediction is considered correct if the caption contains the ground-truth scene text word. The other is to fine-tune the model on two large scene text datasets: MJSynth (MJ) [jaderberg2014synthetic, jaderberg2016reading] and SynthText (ST) [gupta2016synthetic], where the ground-truth scene text is used as the caption. The prediction is correct if the output is the exact match to the ground-truth, i.e., the standard word accuracy. Following the established setup, we evaluate on six standard benchmarks, as depicted in Table Document. Our TextCaps-fine-tuned captioning model achieves an 89.9 accuracy, close to the SOTA (91.9), which demonstrates the strong scene text comprehension capability of our captioning model. After fine-tuning the model on the standard MJ+ST datasets, achieves 92.9 that surpasses the prior arts [fang2021read, he2021visual] of 91.9.


Model and data scaling. To study the trending with data scales, we construct two smaller pre-training datasets: one is the combination of COCO, SBU, CC3M and VG, leading to 4M images or 10M image-text pairs; the other is to further combine CC12M, leading to about 14M images or 20M image-text pairs. When pre-training on small-scale datasets, we use 30 epochs rather than 2 epochs as on the 0.8B data. For the network structure, we name our model as Huge and replace the image encoder with ViT-B/16 and ViT-L/14 from CLIP [clip] as Base and Large, respectively. Fig. Document shows the results on COCO, TextCaps, and VizWiz-QA. On COCO, the base model benefits from 4M to 14M, but the performance drops with 0.8B data. The 14M data are more similar to COCO than the majority of the noisy 0.8B data. Meanwhile, the Base model with limited capacity may not be able to benefit effectively from large-scale data. Similar observations are also reported in [kolesnikov2020big] for ImageNet-1k classification. On TextCaps and VizWiz-QA, all model variants benefit significantly from more pre-training data. Also, a larger backbone improves more especially with 0.8B data.

[width=0.33]fig/scale_coco.eps [width=0.33]fig/scale_textcaps.eps [width=0.33]fig/scale_vizwiz_vqa.eps
(a) COCO (b) TextCaps (c) VizWiz-QA
Figure : Performance with different pre-training data scales and different model sizes.

Scene text in pre-training data. To understand the capability of scene text comprehension, we examine the pre-training dataset and study how many image-text pairs contain the scene text. We first run the Microsoft Azure OCR API555 against all images in CC12M and 500K images in the web crawled images. The OCR result is compared with the associated text. It is considered matched

only if the text contains an OCR result that is longer than 5 characters. It is estimated that 15% of CC12M and 31% of the downloaded images contain scene text descriptions. As the training task is to predict the texts, the network gradually learns to read the scene text.


In the paper, we design and train a simple generative model, named , to map the input image to the associated text description on large-scale image-text pairs. On image/video captioning and question answering tasks, our model achieves new state-of-the-art performance across 12 benchmarks and surpasses the human performance on TextCaps for the first time. For the image classification, we apply the generation task to predict the label name directly. The strategy is different from the existing work with a pre-defined and fixed vocabulary, and is beneficial especially when new category data are added. Limitations. We focus on the pretraining-and-finetuning strategy to improve the absolute performance. Empirically, we find it is not easy to control the generated caption and the model lacks zero-shot and few-shot capabilities, which we leave as future work. Societal impact. Compared with the existing work, our model clearly improves the performance and be more appropriate to help visually-impaired people. The model is pre-trained on large-scale data, and the data are not guaranteed to contain no toxic language, which may poison the output. Although we observe few such instances qualitatively, special care should be taken to deploy the model in practice and more research exploration is required to control the output.


We would like to thank Lin Liang for her help on scene text recognition experiments, Houdong Hu and Min Gao for their help on the pre-training data, and Lu Yuan and Bin Xiao for their help on the pre-trained image encoder. We would like to thank Nguyen Bach, Jiayuan Huang, Luis Vargas, Yumao Lu, Michael Zeng, and Xuedong Huang for their support.