mc-BEiT: Multi-choice Discretization for Image BERT Pre-training

by   Xiaotong Li, et al.
Peking University

Image BERT pre-training with masked image modeling (MIM) becomes a popular practice to cope with self-supervised representation learning. A seminal work, BEiT, casts MIM as a classification task with a visual vocabulary, tokenizing the continuous visual signals into discrete vision tokens using a pre-learned dVAE. Despite a feasible solution, the improper discretization hinders further improvements of image pre-training. Since image discretization has no ground-truth answers, we believe that the masked patch should not be assigned with a unique token id even if a better tokenizer can be obtained. In this work, we introduce an improved BERT-style image pre-training method, namely mc-BEiT, which performs MIM proxy tasks towards eased and refined multi-choice training objectives. Specifically, the multi-choice supervision for the masked image patches is formed by the soft probability vectors of the discrete token ids, which are predicted by the off-the-shelf image tokenizer and further refined by high-level inter-patch perceptions resorting to the observation that similar patches should share their choices. Extensive experiments on classification, segmentation, and detection tasks demonstrate the superiority of our method, e.g., the pre-trained ViT-B achieves 84.1 accuracy on ImageNet-1K classification, 51.2 segmentation, 51.2 segmentation on COCO, outperforming the competitive counterparts.


page 2

page 10


Point-McBert: A Multi-choice Self-supervised Framework for Point Cloud Pre-training

Masked language modeling (MLM) has become one of the most successful sel...

Corrupted Image Modeling for Self-Supervised Visual Pre-Training

We introduce Corrupted Image Modeling (CIM) for self-supervised visual p...

MC-BERT: Efficient Language Pre-Training via a Meta Controller

Pre-trained contextual representations (e.g., BERT) have become the foun...

POS-BERT: Point Cloud One-Stage BERT Pre-Training

Recently, the pre-training paradigm combining Transformer and masked lan...

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

This paper explores a better codebook for BERT pre-training of vision tr...

Masked Frequency Modeling for Self-Supervised Visual Pre-Training

We present Masked Frequency Modeling (MFM), a unified frequency-domain-b...

How to Understand Masked Autoencoders

"Masked Autoencoders (MAE) Are Scalable Vision Learners" revolutionizes ...

1 Introduction

Self-supervised pre-training beit; dino; simclr; simclr_v2; mocov3; ibot is attracting emerging attention due to its effectiveness and flexibility in exploiting large-scale uncurated data, which demonstrates its superiority to supervised pre-training in a wide range of downstream applications, such as classification, detection, and segmentation, etc. Recently, the introduction of vision Transformers vit; deit; swinT

brings about a new revolution to self-supervised learning

beit; mocov3; dino.

Inspired by the great success of BERT BERT

in natural language processing (NLP) tasks, masked image modeling (MIM) has been introduced for visual pre-training as a new pretext task. It is not trivial, because one key barrier lies in that the visual signal is continuous and cannot be properly classified as is done in masked language modeling (MLM) of BERT. A pioneer work, BEiT

beit, tackles the challenge by “tokenizing” continuous visual signals into discrete vision tokens resorting to a pre-learned codebook dalle, which plays the role of a pre-defined vocabulary in MLM. The pre-training objective is to predict the vision token id of the masked image patch based on its context and semantics.

Figure 1: The improper token ids for image discretization, where a better tokenizer taming is used here. We observe that semantically-similar patches might be allocated with different token ids while patches with different semantics might be allocated with the same token id, indicting that the hard-label classification with unique token ids in BEiT beit may hinder the pre-training performance.

Despite the impressive performances of BEiT on image pre-training, there remain some questions under-developed. (1) Does the masked patch prediction have ground-truth answers? Unlike the linguistic vocabulary which is naturally composed of discrete words, the image tokenizer is relatively subjective, i.e., there is no perfect answer to visual discretization and the tokenizer carries inevitable label noise even a better tokenizer is obtained in taming. For example, as shown in Fig. 1, patches of the dog and the shoe are discretized into the same vision token (#319) due to their similar pixel-level representations. (2) Should the masked patch be assigned a unique token id given a pre-learned tokenizer? Not really. As illustrated in Fig. 1, semantically-similar patches of the grass are discretized into different vision tokens, i.e., they are classified into distinct and unique ids in BEiT pre-training, neglecting their semantic relations.

Given the observation of the above two issues, we argue that performing MIM with a strict mapping between patch predictions and unique token ids in the form of a hard-label classification loss in BEiT limits the visual context capturing and the pre-training performance. To tackle the challenge, we introduce to effectively boost BERT-style image pre-training with eased and refined masked prediction targets, that is, multi-choice vision token ids. Rather than retraining the tokenizer with perceptual regularizations peco; ibot, we efficiently inject the semantic relations into off-the-shelf vision tokens without any extra computational overhead.

Specifically, to enable multi-choice answers for masked patches, we adopt the soft id probability vectors, rather than the hard predicted id over a pre-learned codebook, as the supervision signals for masked image modeling. Although the off-the-shelf image tokenizer taming can capture some local semantics with the training objectives of both pixel-level and perceptually-aware regularizations, it is proven to be still vulnerable to various low-level changes (see Fig. 1

). Therefore, we introduce to refine the predicted soft id probabilities by inter-patch semantic similarities, which are estimated by the vision Transformers being trained. Under the observation that patches with similar high-level visual perceptions ought to share their predictions, we propagate the soft id probabilities of different patches in an image based on their semantic similarities and form ensembled learning targets for masked image patches (see Fig.

2). The final training objective is formulated as a soft-label cross-entropy loss.

To fully evaluate our novel, flexible and effective method, we pre-train the vision Transformers with various scales on the widely-acknowledged ImageNet-1K imagenet1k dataset and fine-tune the pre-trained models on multiple downstream tasks, including image classification, instance/semantic segmentation, and object detection. The empirical results show that our method impressively outperforms supervised pre-training as well as recent self-supervised learning methods mocov3; dino; beit; ibot. Concretely, we achieve 84.1% top-1 accuracy on ImageNet-1K classification with a ViT-B model, outperforming the state-of-the-art iBOT ibot

by +0.3% with 800 fewer epochs. Regarding the transfer learning ability on different downstream tasks, our pre-trained ViT-B model achieves 51.2% mIOU on downstream ADE20K

ade20k semantic segmentation, 51.2% and 44.3% of object detection and instance segmentation on COCO coco, outperforming all existing methods.

2 Related Works

Self-supervised learning (SSL) has gained great popularity benefiting from its capability of exploiting the tremendous amounts of unlabeled data, which leverages input data itself as supervision. Substantial works moco; simclr; beit; BERT; GPT3; dino

have shown that the pre-training can be beneficial for downstream tasks and enable faster training convergence, which shows its impressive potentials on various machine learning tasks, especially in the fields of natural language processing and computer vision.

2.1 BERT pre-training with masked language modeling

Self-supervised learning has been studied in NLP for decades. Masked language modeling (MLM) is firstly widely acknowledged because of BERT BERT. BERT encourages bidirectional textual context understanding and adopts the masked language modeling approach for pre-training, which randomly masks 15% tokens and predicts the missing words as the target. After that, various MLM variants are proposed, e.g., GPT GPT3, XLM xlm, and RoBERTa roberta, etc. These MLM works achieve huge success and show impressive performances on various downstream tasks, which greatly advance the development of language pre-training.

2.2 Self-supervised visual pre-training

In the past few years, various pretext tasks are designed for self-supervised visual pre-training. For example, earlier pretext-based works adopt the pseudo labels based on the attributes of images to learn the representation, such as image colorization

color, jigsaw puzzle jigsaw, and rotation prediction rotation, etc. Besides these approaches, there are two mainstream paradigms, i.e., contrastive learning and masked image modeling approaches, which will be further analyzed in the following subsection.

Contrastive learning: Contrastive learning is an instance-level discriminative approach and has occupied a dominant position in visual pre-training. Contrastive learning methods, such as SimCLR simclr; simclr_v2, MoCo moco; mocov3, and Swav swav, etc., typically rely on data augmentation to create the counterparts of the images and aim at learning such an embedding space, where similar sample pairs are close to each other while dissimilar ones are far apart. Swav swav proposes a cluster-based contrastive learning method to enforce consistency between cluster assignments under different augmentations. BYOL byol abandons the negative samples and avoids the collapse with an additional online network. MoCov3 mocov3 extends the contrastive learning framework for transformers and further promotes the development of self-supervised vision Transformers.

Masked Image Modeling: Motivated by the great success of BERT, masked image modeling (MIM) beit; ibot; mae becomes a new trend in self-supervised visual pre-training, which randomly masks parts of images and reconstructs them based on the corrupted image. ViT vit attempts to adopt masked patch prediction for self-supervised learning. BEiT beit predicts the discrete tokens of masked token resorting to an off-the-shelf discrete VAE. Instead of discretizing the visual information, MAE mae and SimMIM simmim propose to directly predict the pixel-level value as the reconstruction target. MaskFeat maskfeat further exploits different supervision signals such as HOG feature to be the objective. iBOT ibot performs masked prediction and adopts the teacher network as an online tokenizer to provide the supervision. PeCo peco further provides the evidence that the perceptually-aware tokenizer will provide better pre-training performance for the masked image modeling.

3 Preliminaries

3.1 Image BERT Pre-training with Masked Image Modeling

The paradigm of mask-and-then-predict is first introduced in BERT pre-training BERT of NLP tasks to encourage bidirectional context understanding of the textual signals. Recent works beit; ibot reproduce the success of BERT by employing the proxy task of masked image modeling (MIM) on image pre-training of vision Transformers vit; deit; swinT. MIM requires randomly masking a proportion of the image patches and then training the vision Transformer to recover the corrupted image via reasoning among the visual context. The pretext task of MIM enables a more fine-grained understanding of the local visual semantics compared to the contrastive counterparts simclr; moco. Vision Transformers pre-trained with MIM objectives can be well transferred to a wide range of downstream tasks, i.e., classification, segmentation, and detection, after fine-tuning.

3.2 Masked Image Modeling as Single-choice Classification

Introducing the mask-and-then-predict paradigm into image pre-training is actually non-trivial, because the visual signals are continuous and cannot be predicted resorting to a well-defined vocabulary. A pioneering work, BEiT beit, tackles the challenge by casting masked patch prediction as a single-choice classification problem via discretizing the image into vision tokens with an off-the-shelf “tokenizer”. The “tokenizer” can be a discrete auto-encoder dalle; taming pre-learned towards the reconstruction objective.

Formally, given a raw image , it is initially divided into patches and then mapped into compact patch embeddings. We denote the corrupted image as , which is formed by masking part of the patches in , and we denote the set of masked patch indices as . We encode the image patch features with high-level perceptions by feeding into the vision Transformer. The patch features are further projected to the probabilities of the vision token ids using an MLP head which will be dropped for downstream tasks. We denote the probability vectors as where is the length of the visual vocabulary defined by the pre-learned image “tokenizer”.

To receive the answers for the masked image modeling, we discrete the raw image into vision tokens using the image “tokenizer”, where . The assigned token id with the maximal probability in is termed as . The pre-training objective is formulated as a hard-label cross-entropy loss to encourage masked patch prediction with unique token ids as follow,


4 mc-BEiT

BEiT provides inspiring insights of casting masked image modeling (MIM) as a classification problem to bridge the gap between discrete words in NLP tasks and continuous visual signals in computer vision tasks. However, as there are no perfect answers for visual discretization, performing a strict mapping between patch predictions and unique token ids as a single-choice classification problem is actually a sub-optimal solution for MIM pre-training. As illustrated in Fig. 1, there may exist multiple appropriate token ids for a certain patch, motivating us to boost the BEiT pre-training with multi-choice classification.

4.1 Masked Image Modeling as Multi-choice Classification

Figure 2: The overview of the proposed method, mc-BEiT. We improve image BERT pre-training with multi-choice training objectives, which is composed of the soft probability vectors predicted by the off-the-shelf image “tokenizer” and further refined by high-level inter-patch perceptions. A proportion of image patches are randomly masked and then fed into the vision Transformer. The masked patch prediction is optimized towards eased and refined multi-choice token ids in the form of a soft-label cross-entropy loss.

We introduce an improved BERT-style image pre-training with eased and refined masked prediction targets, i.e., multi-choice vision token ids, rather than a unique answer. All possible token ids in the visual vocabulary will be assigned possibilities to be chosen. To this end, we soften the training objective from the original hard-label cross-entropy loss to a soft-label cross-entropy loss with the multi-choice targets as follow,


where and . We will go over how to produce such refined multi-choice answers for MIM pre-training in the following section.

4.2 Multi-choice Visual Discretization

To produce multi-choice discretization without extra training stages or computational overhead, we attempt to exploit the predictions from the off-the-shelf image tokenizer. Given the discretization predictions from the image tokenizer, we estimate the soft probabilities rather than using the unique predicted token id as done in the single-choice version. Specifically, the soft probability vector is obtained using a softmax operation, where a temperature coefficient is used to move between the sharpness (single-choice) and smoothness (multi-choice),


As discussed in the introduction section and illustrated in Fig 1

, semantically-similar patches may be allocated with discrepant token ids and semantically-dissimilar patches may be allocated with the same token id due to their low-level similarities, indicating that the raw predictions from the off-the-shelf tokenizer are sub-optimal to fully represent the semantic relations among patches. The phenomenon motivates us to refine the predictions of the tokenizer with inter-patch relations. The inter-patch relations can be estimated with their high-level perceptions, which are encoded by the in-training vision Transformer. Specifically, we calculate the cosine similarity between patch features to measure their affinities,


where and indicates the inner product between two feature vectors after normalization. Based on the observation that perceptually-similar patches ought to share their choices, we propagate the soft probabilities of different patches in an image to form a refined target . In this way, patches with similar high-level perceptions can provide complementary supervision signals for the masked patches.

The overall objective of multi-choice image discretization is composed of the weighted sum of the aforementioned parts, where the semantic equilibrium coefficient is introduced to move between low-level semantics (directly predicted by the tokenizer) and high-level semantics (ensembled from the perceptually-similar patches). The former one adopts the eased supervision directly predicted from the tokenizer, while the latter one injects high-level perceptions by propagating among other semantically-similar patches, together forming the refined multi-choice targets as follow:


which is further used as the objectives for masked patch predictions in Eq. (2).

5 Experiments

5.1 Pre-Training Setup

In our experiments, the images of 224224 resolution are divided into 1414 image sequences with 1616 patch size. We use different architectures such as ViT-Base/16 and ViT-Large/16 for pre-training and the backbone implementation follows vit for fair comparisons. For the BERT-style visual pre-training, we randomly mask 75% patches for masked image modeling. Inspired by PeCo peco, we employ the off-the-shelf VQGAN of taming as a better tokenizer, which is pre-trained on OpenImages OpenImages2 with the vocabulary size of 8192. In our experiments, the semantic equilibrium coefficient is 0.8 and the temperature coefficient is 4.0 by default. The vision Transformers are pre-trained for 800 epochs on the widely-acknowledged ImageNet-1K imagenet1k dataset, which includes 1.28 million images. Note that the ground-truth labels are disabled for pre-training. We use 16 Nvidia A100 GPUs for pre-training and a batch size of 128 per GPU. We adopt simple image augmentation for pre-training, including random resized cropping and horizontal flipping. The detailed recipe of pre-training and finetuning is summarized in the supplementary materials.

Method Reference Pre-train Epoch Acc (%)
Supervised Pre-training (training from scratch):
ViT-B/16 vit ICLR 2021 - 77.9
ViT-L/16 vit ICLR 2021 - 76.5
DeiT-B/16 deit ICML 2021 - 81.8
Self-supervised Pre-training using ViT-B/16:
MoCo v3 mocov3 CVPR 2021 300 83.2
DINO dino ICCV 2021 300 82.8
BEiT beit ICLR 2022 800 83.2
iBOT ibot ICLR 2022 1600 83.8
Ours this paper 800 84.1
Self-supervised Pre-training using ViT-L/16:
MoCo v3 mocov3 CVPR 2021 300 84.1
BEiT beit ICLR 2022 800 85.2
Ours this paper 800 85.6
Table 1: The top-1 fine-tuning accuracy of ImageNet-1K using ViT-B and ViT-Large with different pre-training methods.

5.2 Image Classification

For the ImageNet classification task, the fully-connected layer is employed as the classifier after the average pooling of the feature embeddings. We adopt top-1 accuracy after fine-tuning as the evaluation metric and we thoroughly compare our method with the supervised methods,

i.e., ViT vit, DeiT deit, and recently published state-of-the-art self-supervised learning methods, i.e., MoCo v3 mocov3, DINO dino, BEiT beit, and iBOT ibot. The experiment results are listed in Tab. 1. As observed from the results, the proposed method obtains 84.1% and 85.6% top-1 accuracy on ViT-B and ViT-L, outperforming the competing methods and achieving state-of-the-art performance. We can see that our mc-BEiT shows significant gains compared to the baseline BEiT, which verifies the effectiveness of our introduced multi-choice objectives. Concretely, our method outperforms the recent state-of-the-art method iBOT ibot by +0.3% with the fewer 800 epochs pre-training. It is noted that iBOT adopts an extra teacher network and enables multi-crops for pre-training, showing lower efficiency than our method.

Different training epochs and architectures: We also provide more comprehensive results of different training epochs and architectures in Tab. 2. From the table, we can see that our method can adapt well to different scales of vision tranformers, e.g., the mostly used ViT-B and ViT-L. It is worth noting that our method obtains a relatively high accuracy (already achieves the state-of-the-art performance) when pre-training for only 300 epochs. Moreover, the performance can be further improved with longer pre-training epochs, e.g., the accuracy reaches 84.1% pre-training for 800 epochs.

Method Arch. Model Size Pre-train Epoch Acc (%)
Self-supervised Pre-training using ViT-B/16:
Ours ViT-B 86M 100 83.3
Ours ViT-B 86M 300 83.8
Ours ViT-B 86M 800 84.1
Self-supervised Pre-training using ViT-L/16:
Ours ViT-L 307M 300 85.2
Ours ViT-L 307M 800 85.6
Table 2: The top-1 fine-tuning accuracy of ImageNet-1K using our mc-BEiT with different training epochs and backbone architectures.

Convergence curve: In Fig 3, we further demonstrate the convergence curve of the supervised learning method and self-supervised learning methods, i.e., the baseline BEiT and our method, when fine-tuning ViT-B models. As shown in the figure, the proposed method achieves faster convergence as well as better performance than training DeiT from scratch deit. Meanwhile, our method obtains obvious and consistent performance gains compared to the baseline method BEiT, showing the superiority of the proposed multi-choice training objectives.

Figure 3: The convergence curves when fine-tuning ViT-B models on ImageNet-1K classification. The models are pre-trained by different methods.

5.3 Semantic Segmentation

Semantic segmentation belongs to the pixel-level classification task and is often adopted to evaluate the pre-training performance on downstream tasks. Here we evaluate the performance on ADE20k ade20k benchmark and mean intersection over union (mIOU) averaged over all semantic categories is adopted as the evaluation metric. Following the common setting in beit; ibot, ViT-B is adopted as the default backbone and UPerNet upernet is used for semantic segmentation task head.

Method Reference mIOU
Supervised deit ICML 2021 45.3
MoCo v3 mocov3 CVPR 2021 47.2
DINO dino ICCV 2021 46.8
iBOT ibot ICLR 2022 50.0
BEiT beit ICLR 2022 45.6
Ours this paper 50.2
+Intermediate Fine-tuning
BEiT beit ICLR 2022 47.7
Ours this paper 51.2
Table 3: Results of semantic segmentation on ADE20K. Intermediate fine-tuning denotes the pre-trained model has been fine-tuned on ImageNet-1K classification. The vision Transformer backbone is ViT-B and the task head is UPerNet.

Tab. 3 shows that our method significantly improves the transferability of pre-trained models compared to the supervised learning, with +4.7% performance gain. It is also noticed that our method outperforms recent state-of-the-art self-supervised methods. Because the pre-training process does not introduce the instance discrimination, the performance can be further improved after intermediate fine-tuning on ImageNet-1K according to BEiT beit. Therefore we also compare the performance after intermediate fine-tuning. It achieves better result as 51.2% mIOU and improves +1.0% gain to its pre-training only version.

5.4 Object Detection and Instance Segmentation

For object detection and instance segmentation tasks, COCO coco

benchmark is employed to validate the pre-training performances. Here, we use absolute position embedding and interpolate it to adapt to the multi-scale strategy. ViT-B is adopted as the backbone and Cascaded Mask-RCNN

cascade; maskrcnn is used as the task head. The evaluation metrics for objection detection and instance segmentation are bounding box AP and mask AP, respectively.

Method Reference Object Det. Instance Seg.
Supervised deit ICML 2021 47.9 42.9
MoCo v3 mocov3 CVPR 2021 47.9 42.7
DINO dino ICCV 2021 50.1 43.4
iBOT ibot ICLR 2022 51.2 44.2
BEiT beit ICLR 2022 49.6 42.8
Ours this paper 50.1 43.1
+Intermediate Fine-tuning
BEiT beit ICLR 2022 50.7 43.8
Ours this paper 51.2 44.3
Table 4: Experiment results of object detection and instance segmentation on COCO. Intermediate fine-tuning denotes the model is further fine-tuned on ImageNet-1K. Cascaded Mask R-CNN and 1 training schedule are adopted. As experiments on COCO are not conducted in BEiT, the results are based on our re-implementation.

As observed in Tab. 4, the BERT style pre-training shows superiority to supervised pre-training in terms of performances. Meanwhile, the proposed method outperforms the competitor BEiT with +0.5%/+0.3% gain in and . We also evaluate the performance after intermediate fine-tuning, the relative improvement is still obvious and our method obtains considerable performance compared to state-of-the-art methods.

6 Ablation Study

In this section, we conduct an extensive ablation study of our method on ImageNet-1K. Considering the time expenditure, all ablation experiments are performed under 100-epoch pre-trained ViT-B/16 on ImageNet-1K.

6.1 The temperature coefficient

The hyper-parameter of temperature coefficient

is to scale the logits from the tokenizer, which moves between the sharpness (single-choice) and smoothness (multi-choice). We adopt the common values for temperature to ablate its effect. In general, the small temperature will sharp the probability distribution and the large one will smooth it conversely. When

is extremely small, it is an approximate single-choice classification task. The ablation is shown in the Tab. 5, where single-choice label indicates training with the strict mapping to the unique answer. From the result, we can observe that multi-choice vision token improves the BERT style pre-training performance and it behaves better when setting temperature factor at 4.0 empirically.

Single-choice Label Multi-Choice Label
0.04 1.0 4.0 10.0
Acc 83.0 83.1 83.1 83.3 83.2
Table 5: Ablation study on the hyper-parameter of temperature coefficient .

6.2 The semantic equilibrium coefficient

The semantic equilibrium coefficient is introduced to move between low-level semantics (directly predicted by the tokenizer) and high-level semantics (ensembled from the perceptually-similar patches). The ablation study is shown in Fig. 4. When setting to 0, the objective relies on totally the inter-relationship guided objective and it achieves only 81.8% accuracy, which is because the inevitable noise of calculating patch similarity, especially in the early epochs, will cause collapse and degrade the pre-learning performance. As the coefficient goes larger, it shows consistent gains over baseline. When setting to 1.0, the objective comes only from the low-level signals of the tokenizer and the performance is still higher than baseline, which shows the superiority of multi-choice to single-choice. As observed from the results, the semantic equilibrium coefficient is thus set to be 0.8 for better performances.

Figure 4: Ablation study on the trade-off parameter .

6.3 Masking strategy

In the masked image modeling approach, the masking strategy determines the difficulty of inferring the missing patches. Tab. 3 shows the influence of different mask strategies, where Block and Random masking types and different mask ratios are conducted for ablation. It is observed from the results that the random masking strategy with 75% masking ratio makes the best performances, which is thus adopted as the default setting for pre-training.

Masking Strategy Masking Ratio Acc (%)
Block 45 % 83.2
Block 60 % 83.2
Block 75 % 82.8
Random 45 % 83.0
Random 60 % 83.1
Random 75 % 83.3
Random 90 % 83.0
Table 6: Ablation study on different masking strategies.

6.4 Tokenizer

In the BERT-style visual pre-training, the tokenizer plays the role of a vocabulary in texts and is used to produce the discrete vision token as supervision. As discussed in PeCo peco, perceptually-aware tokenizer may benefit the image BERT pre-training, so we introduce to use off-the-shelf VQGAN taming as a better tokenizer throughout our experiments. Besides, we would like to also verify the effectiveness of our multi-choice objectives on top of the vanilla BEiT.

Training Data Top 1 Acc. (100 / 800 epochs)
Source Scale BEiT Ours
DALL-E dalle Private 250M 82.3 / 83.2 82.6 / 83.7
VQGAN taming OpenImage 9M 82.9 / 83.8 83.3 / 84.1
Table 7: Ablation study on the different tokenizer.

The influence of tokenizers is shown in Tab. 7. It is shown that adopting the VQGAN as the tokenizer brings better performance than DALL-E, which verify our observation that tokenizer with high semantics can indeed improve the pre-training performance. It also indicates that enhancing the semantic relation is beneficial to visual pre-training. Meanwhile, it is noticed that the relative improvement of our method is consistent regardless of different kinds of tokenizers, which demonstrates the effectiveness of the proposed method.

6.5 Visualization

Besides the quantitative experiment results, we further provide some visualizations in Fig. 5 for better understanding the effects of our multi-choice answers. 75% patches of the images are randomly masked for prediction. It can be observed from the blue box in Fig. 5(a), the adjacent patches with similar semantics are still allocated with different vision token ids, indicating that the hard vision token id directly from the tokenizer neglects the semantic relations and is a sub-optimal objective. In contrast, the proposed eased and refined objective can provide diverse possible vision tokens for the prediction. As shown in our multi-choice token signals, the semantically-similar patches have the possibility to be allocated with the same vision token, which refines the objective with inter-patch perceptions. Furthermore, we randomly select a masked patch and shows the inter-patch perception relations (obtained from the patch feature similarity) learned by the pre-trained model in Fig. 5(c). Taking the patch located at the edge of the car for illustration, the similar patches can still be well estimated even under 75% random masking and the inter-patch relation shows higher responses to the skeleton of the car. It demonstrates that the informative semantic relations estimated by the in-training vision Transformer can properly enhance the multi-choice discretization for pre-training.

Figure 5: The visualization is obtained using the off-the-shelf tokenizer and our pre-trained vision Transformer. (a): the original image. (b) corrupted image under 75% random masking. (c): inter-patch perception relation, which is equipped with contour lines for better visual effect.

7 Conclusion

In this paper, we propose the mc-BEiT, i.e., multi-choice discretization for improving image BERT pre-training. Instead of adopting the unique label signals from the tokenizer, we introduce an eased and refined objective for providing multi-choice answers. Extensive experiments are conducted to evaluate the performances of our method. The empirical results show that mc-BEiT achieves the state-of-the-art performances on various tasks, such as image classification, semantic/instance segmentation, and objection detection.


Appendix A Appendix

Appendix B Implementation Details

In the appendix, we provide the specific hyper-parameters of the experiments in our paper, including pre-training on ImageNet-1K and fine-tuning on different downstream tasks.

b.1 Configuration for pre-training

The vision Transformers are pre-trained on the large-scale dataset ImageNet-1K [imagenet1k] and the configurations are summarized in Tab. 8. The implementation of the vision Transformers, i.e., ViT-Base/16 and ViT-Large/16, follows [deit] for fair comparisons and the training recipe is based on BEiT [beit].

Configuration ViT-Base/16 ViT-Large/16
Layers 12 24
Hidden size 768 1024
FFN inner hidden size 3076 4096
Attention heads 12 16
Attention head size 64
Patch size
Training epochs 800
Batch size 2048
Adam 1e-8
Adam (0.9, 0.98)
Peak learning rate 1.5e-3
Minimal learning rate 1e-5
Learning rate schedule Cosine
Warmup epochs 10
Gradient clipping 3.0 1.0
Dropout None
Stoch. depth 0.1
Weight decay 0.05
Data Augment RandomResizeAndCrop
Input resolution
Table 8: Configurations for pre-training.

b.2 Configuration for fine-tuning

Classification task on ImageNet-1K: For the classification task, the fully-connected layer is employed as the classifier after the average pooling of the feature embeddings. The fine-tuning configurations on ImageNet-1K for different backbone architectures are listed in Tab. 9.

Configuration ViT-Base/16 ViT-Large/16
Peak learning rate {2e-3,3e-3,4e-3,5e-3}
Fine-tuning epochs 100 50
Batch size 1024
Warmup epochs 20 5
Layer-wise learning rate decay 0.65 0.75
Adam 1e-8
Adam (0.9, 0.999)
Minimal learning rate 1e-6
Learning rate schedule Cosine
Repeated Aug None
Weight decay 0.05
Label smoothing 0.1
Stoch. depth 0.1
Dropout None
Gradient clipping None
Erasing prob. 0.25
Input resolution
Rand Augment 9/0.5
Mixup prob. 0.8
Cutmix prob. 1.0
Color jitter 0.4
Table 9: Configurations for fine-tuning on ImageNet-1K.

Semantic segmentation on ADE20K: For the semantic segmentation experiments on ADE20K [ade20k], we follow the implementation of BEiT [beit] and adopt UperNet [upernet] as the task layer. ViT-B [vit] is adopted as the default backbone and UPerNet [upernet] is used as the task head. Tab. 10 summarizes the configurations for fine-tuning on ADE20k. Because the pre-training process does not introduce the instance discrimination, the performance can be further improved after intermediate fine-tuning on ImageNet-1K according to BEiT [beit]. we also evaluate the performances after intermediate fine-tuning, where the pre-trained models have been fine-tuned on ImageNet-1K. For the models with intermediate fine-tuning, the training recipe is the same as the pre-training only version.

Configuration ViT-Base/16
Peaking learning rate 1e-3
Fine-tuning steps 160K
Batch size 16
Adam 1e-8
Adam (0.9, 0.999)
Layer-wise learning rate decay 0.65
Minimal learning rate 0
Learning rate schedule Linear
Warmup steps 1500
Dropout None
Stoch. depth 0.1
Weight decay 0.05
Input resolution 512512
Position embedding Relative
Position embedding interpolate Bilinear
Table 10: Configurations for fine-tuning on ADE20k.

Object detection and instance segmentation: We adopt the well-known COCO [coco] benchmark for the experiments of object detection and instance segmentation. Because these experiments are not conducted on BEiT, we conduct the experiments following iBOT [ibot] and the results of BEiT [beit] are based on our re-implementation. In order to adapt to the multi-scale strategy, we use absolute position embedding and interpolate it for different image resolutions. ViT-B [vit] is adopted as the backbone and Cascaded Mask-RCNN [maskrcnn, cascade] is used as the task head. Tab. 11 summarizes the configurations for fine-tuning on COCO. Similar to the experiment of semantic segmentation, the training recipe of models with intermediate fine-tuning is the same as the pre-training only version.

Configuration ViT-Base/16
Peaking learning rate 1e-4
Fine-tuning epochs 12
Weight decay 0.05
Learning rate decay 8, 11
Batch size 16
Adam 1e-8
Adam (0.9, 0.999)
Layer-wise learning rate decay 0.75
Stoch. depth 0.1
Multi-scale evaluation None
Position embedding Absolute
Position embedding interpolate Bilinear
Table 11: Configurations for fine-tuning on COCO.