Learning a general perception model that can handle various modalities and tasks is widely regarded as an important step towards artificial general intelligence. Due to its difficulty, many works (e.g., Florence [yuan2021florence], CoCa [yu2022coca], BEiT-3 [wang2022image]), also known as foundation models [bommasani2021opportunities], instead focus on a fallback solution of learning a general representation encoder that can be adapted (e.g., fine-tuned) to various downstream tasks. By performing large-scale pre-training on massive multi-modal task-agnostic data, these works have demonstrated the superiority by pushing the state-of-the-art results on a broad range of tasks including single-modal tasks (e.g., image classification and object detection) and also cross-modal tasks (e.g., image captioning and image retrieval).
Despite the success, there is still a considerable gap between foundation models and the goal of general perception modeling. While foundation models only focus on general representation learning, task modeling is neglected. Traditional task-specific fine-tuning paradigm is still utilized (see Fig. 1). This significantly increases the marginal cost of adapting pre-trained models to various downstream tasks, making it difficult to meet the rapidly growing demands of diverse downstream tasks and scenarios. Such a task-specific fine-tuning paradigm of foundation models is inconsistent with the goal of general perception modeling.
Instead of performing task-specific fine-tuning, generalist models process different tasks with shared architecture and parameters, which is aligned with the goal of general perception modeling. It not only reduces the cost of handling diverse tasks but also enables task collaboration. Most existing attempts on generalist models are sequence-to-sequence (seq2seq) models [pix2seqv2, ofa, unifiedio, alayrac2022flamingo, gpv1, gpv2, gato, unitab]. However, these attempts are inadequate in both versatility and performance: (1) some pillar vision and vision-language tasks as listed in Tab. 1 cannot be handled, e.g., image-text retrieval, object detection, and instance segmentation; (2) the accuracy and inference speed still lag significantly behind state-of-the-art task-specific methods. Another line of research named Uni-Perceivers [zhu2022uni, zhu2022uni_p] builds generalist models supporting both generation and non-generation tasks. Nevertheless, they still cannot handle many vital tasks such as detection and segmentation.
To develop generalist models with better versatility and performance, our core idea is to encode images as general region proposals consisting of the semantic, bounding box and segmentation mask representations. Compared with previous methods where images are represented as non-overlapping patches, this design makes our localization modeling more expressive and flexible. This explicit utilization of localization clues not only greatly reduces the difficulty of handling localization tasks such as image detection and segmentation, but also provides richer features for non-localization tasks, thus enabling more general task modeling and better performance.
In this paper, we propose Uni-Perceiver v2 as a generalist model capable of handling major large-scale vision and vision-language tasks as listed in Tab. 1. Specifically, images are encoded as a concatenation of global and regional representations via a region proposal network, while texts are encoded via a Transformer-based language model. Both the image and text encoders can benefit from off-the-shelf pre-trained models, which reduces the demand for training data and resources and ensures performance. The encoded representations are transformed by a shared modality-agnostic Transformer [vaswani2017attention] network to obtain the decoded representations. Following Uni-Perceivers [zhu2022uni_p, zhu2022uni], different tasks are formulated as a unified maximum likelihood estimation problem and are jointly learned to enable general task adaptation. We further propose an improved optimizer named MT-AdamW to ensure stable multi-task learning with an unmixed sampling strategy which only samples one task for all GPUs per iteration. This is very helpful for tasks requiring large batch size training.
Uni-Perceiver v2 is the first generalist model achieving competitive results on major large-scale vision and vision-language tasks including object detection, instance segmentation, image classification, image captioning, and image-text retrieval, except for image generation that has not been verified due to limited computational resources. After being jointly trained on various tasks, it can directly handle a broad range of tasks without any task-specific adaption, achieving state-of-the-art performance among existing generalist models. Our contributions are summarized as:
We propose Uni-Perceiver v2, which is the first generalist model capable of handling both localization and non-localization tasks with competitive performance. The general region proposal encoding of images brings more flexible and expressive localization modeling.
To improve the effectiveness of multi-task learning, we adopt an unmixed sampling strategy to enable large batch-size training and develop an improved optimizer named MT-AdamW to mitigate the instability in gradients.
Uni-Perceiver v2 outperforms all existing generalist models in both versatility and performance. Without any task-specific adaption, Uni-Perceiver v2 achieves competitive performance on a broad range of downstream tasks compared with commonly-recognized strong baselines that require task-specific fine-tuning, demonstrating its strong ability of general task modeling.
|Key point detection|
|Referring expression grounding|
|Human object interaction|
|Optical character recognition|
|Mask Predication||Instance segmentation|
|Image Generation||Image synthesis|
|Segment-based image generation|
|Surface normal estimation|
Image super resolution
|Image to Text||Image captioning|
|Visual question answering|
|Visual commonsense reasoning|
2 Related Work
Foundation Vision Models are “designed to be adapted (e.g., fine-tuned) to various downstream tasks by pre-training on broad data at scale” [bommasani2021opportunities]. Such large-scale pre-trained vision models have shown effectiveness in enriching data encoding capacity, alleviating data hunger, and improving the performance of downstream tasks.
Image classification on ImageNet-1k[deng2009imagenet]
has been the mainstream pre-training paradigm for a long period. However, as the model size grows, larger annotated datasets are required to avoid over-fitting in pre-training, such as ImageNet-21k[deng2009imagenet], Instagram-1B [mahajan2018exploring], JFT-300M [sun2017revisiting] and JFT-3B [zhai2022scaling]. Inspired by the success of linguistic pre-training on massive web-crawled text, CLIP [clip] and ALIGN [align] have begun to focus on multi-modal contrastive pre-training on web-scale noisy image-text pairs to learn aligned image and text representations. SimVLM [wang2021simvlm] employs the multi-modal sequence generation task for pre-training. FLAVA [singh2021flava] combines contrastive and generative pre-training to handle both unimodal and multimodal tasks. UniCL [yang2022unified] and CoCa [yu2022coca] jointly use human-annotated and web-crawled data. Florence [yuan2021florence] and INTERN [shao2021intern] increase the scale and diversity of pre-training data to enhance the representation capability. OmniVL [wang2022omnivl] proposes to incorporate both image-language and video-language tasks in its pre-training. BEiT-3 [wang2022image] unifies pre-training objectives for different modalities as a single masked data modeling task, achieving state-of-the-art results on a wide range of downstream tasks.
These works on foundation models only focus on general representation learning, while neglecting task modeling. When adapting them to downstream tasks, the traditional task-specific fine-tuning paradigm is still utilized, which is inconsistent with the goal of general perception modeling. Meanwhile, with the rapidly growing demands of diverse tasks and scenarios, the task-specific fine-tuning paradigm would result in a prohibitive marginal cost for data collection, data annotation, model training, and model storage.
handle various tasks with shared architecture and parameters, which have been long pursued by the machine learning community. Recently, inspire by the success of sequence-to-sequence (seq2seq) models in NLP field[radford2018improving], OFA [ofa], Flamingo [alayrac2022flamingo], and GIT [wang2022git] propose to model various tasks as a sequence generation task. Unified-IO [unifiedio], Pix2Seq v2 [pix2seqv2], and UniTab [unitab] further develop this method to support more tasks by introducing discrete coordinate tokens, thus location information can be encoded or decoded by the unified models. Beyond that, Gato [gato]
succeeds in unifying reinforcement learning tasks into the seq2seq framework. GPV[gpv1] also builds a general-purpose vision system by adding a seq2seq module on a DETR [carion2020end]-based visual encoder.
However, these methods with seq2seq formulation are still inadequate in both versatility and performance: (1) They cannot handle some core vision tasks, e.g., image-text retrieval, object detection, and instance segmentation. Although Pix2Seq v2 [pix2seqv2] includes detection and instance segmentation tasks, its performance and inference speed still lag significantly behind state-of-the-art task-specific methods [zhang2022dino, li2022mask]
; (2) The non-parallel auto-regressive decoding leads to slow inference speed. For example, image classification requires calculating and comparing the cumulative probabilities of all category names conditioned on the given image; (3) They also suffer from the task-interference issue in multi-task learning, resulting in performance degradation compared with task-specific models.
Alternatively, Uni-Perceivers [zhu2022uni_p, zhu2022uni] formulate different tasks as finding the maximum likelihood target for each input through the representation similarity regardless of their modality, making it possible to support both generation and non-generation tasks. Nevertheless, they still cannot handle image detection and segmentation tasks.
3 Revisiting Uni-Perceivers
Unified Modeling of Perception Tasks. Uni-Perceiver [zhu2022uni_p] proposes to reformulate different tasks as a unified maximum likelihood estimation problem. Specifically, each task is defined with a set of inputs and a set of candidate targets from arbitrary combinations of modalities. The inputs and targets are first encoded with a modality-specific tokenizer with linear projection. Then the encoded representations are transformed by modality-agnostic decoder with shared parameters for different tasks. Given an input, the unified task objective is defined as finding the target with the maximum likelihood with the input.
Mitigating Task Interference. Multi-task learning with fully shared parameters could introduce interference between different tasks. Uni-Perceiver-MoE [zhu2022uni] proposes Conditional MoEs to address the task-interference issue. Specifically, for each input token, a routing decision is calculated depending on specific routing strategy, which sparsely activates a small portion of experts to process this token. The corresponding output of an input token is the linearly weighted combination of those selected experts by the routing decision. Conditional MoEs mitigate the interference issue by allowing conflicting modalities and tasks using separate parameters without introducing any task-specific modules.
Limitations. Although Uni-Perceivers aim to process different tasks with a unified architecture, it fails to handle detection and segmentation tasks due to the lack of localization information in its encoded features. Meanwhile, Uni-Perceivers do not integrate off-the-shelf encoder models, making it unable to benefit from existing large-scale pre-trained encoders. This potentially increases its demand for pre-training data and resources, limiting its performance.
4.1 Encoding Images as General Region Proposals
Most existing generalist models [zhu2022uni_p, zhu2022uni] represent images as non-overlapping patches with fixed sizes. This design is rather coarse and limited in modeling objects of varying sizes and shapes in images, making it difficult to handle localization tasks such as detection and segmentation.
In order to enable more expressive and flexible localization modeling, we propose to encode the input image as a sequence of general region proposals. Specifically, given an input image with height and width , a network is employed to encode the image as the concatenation of global and regional representations as
where are the global representations of the whole image, and are the regional representations of candidate object proposals in the image.
Following the common practice in localization tasks, an image backbone network (e.g., ResNet [he2016deep]) is firstly employed to extract the multi-scale feature maps , where is the number of feature scales (e.g., ).
Regional Representations. A Transformer [vaswani2017attention]-based region proposal network is applied on top of the multi-scale feature maps to extract a set of candidate object proposals , where , , and are the semantic, bounding box, and segmentation mask representations of the -th proposal, respectively. The region proposal network is similar to MaskDINO [li2022mask], but only considers foreground-background binary classification. See Appendix for detailed implementation. These three representations are then fused as the regional representation as
where denotes the positional encoding of box coordinates. uses adaptive average pooling to scale the mask predictions to the size of . Both and are followed by linear projections to match the feature dimension.
Global Representations. The global representations are extracted from the last-scale feature map with height and width . instances of parameterized Attention Pooling [clip] are employed to extract global features. The pooled features are concatenated with the flattened feature map to obtain the global representations as
4.2 Encoding Text with Language Models
A Transformer [vaswani2017attention]-based language model is used to encode textual data, such as category names in classification tasks, image descriptions in image-text retrieval tasks, and the vocabulary in image captioning tasks. Specifically, a BPE tokenizer [sennrich2015neural] tokenizes the input text into a sequence of word embeddings, and a Transformer encoder is employed to extracts the text feature sequence as
where is the encoded feature of the -th word, and is the sequence length. In our implementation, we use a pre-trained RoBERTa [liu2019roberta] as the text encoder, which is jointly tuned with the whole network.
4.3 General Task Adaptation
We follow Uni-Perceivers [zhu2022uni_p, zhu2022uni] to formulate different tasks as a unified maximum likelihood estimation problem. Given an input and the candidate target set , the task objective is defined as finding the target with the maximum likelihood as
where the likelihood
is estimated from the cosine similarity between the representations ofand as
where is the modality-specific encoders and introduced in Sec. 4.1 and 4.2, respectively. is a modality-agnostic Transformer [vaswani2017attention] network shared for different tasks, and is a learnable temperature parameter.
Depending on task requirements, the modality-specific encoded representation for inputs can be an image feature sequence , a text feature sequence , or their concatenation, with an additional
<SPE> token inserted at the beginning. The encoded representation for targets is constructed in the same way.
To obtain general task modeling capability, Uni-Perceiver v2 conducts multi-task learning on various uni-modal and multi-modal tasks. Denoting a set of tasks as , where and are the input set and target set of the -th task, respectively. The training loss is
where and denote the sampling ratio and loss weight of the -th task, respectively. The sampling ratio are normalized as . We refer to Sec. 4.4 for detailed discussions of the sampling strategy. To mitigate the task interference in multi-task training, we follow Uni-Perceiver-MoE [zhu2022uni] to employ the Conditional MoEs with attribute-level routing strategy for effective multi-task training.
Tasks with Localization. Uni-Perceiver v2 can perform localization tasks such as object detection and instance segmentation by decoding the regional representations. Specifically, for each region proposal , its outputted feature from the unified decoder will be compared with class embeddings to obtain the class prediction as in Eq. (5). The corresponding bounding box and segmentation mask will serve as the localization predictions.
Tasks without Localization. Uni-Perceiver v2 can also handle tasks that do need localization predictions, e.g., image classification, image captioning, image-text retrieval. It follows a similar formulation of Uni-Perceiver for these tasks with two major differences: (1) More expressive and flexible localization clues for images, better facilitating these tasks; (2) Both the image and text encoders can leverage off-the-shelf modality-specific pre-trained models, leading to better performance.
4.4 Sampling Strategy and Improved Optimization
Optimizing generalist models follows the paradigm of multi-task learning, which performs joint training on data from different tasks. Current methods usually mix all tasks in one training iteration [zhu2022uni_p, ofa, unifiedio]. Such mixed sampling strategy limits the batch-size of each task, which can be detrimental for tasks that benefit from large batch-size training (e.g., image-text retrieval).
A straightforward solution is to sample only one task per iteration, which we refer as unmixed sampling strategy. It can achieve the largest training batch-size. However, when different iterations sample different tasks, the gradients would vary greatly due to the differences in data and tasks, which may bring potential instability to multi-task learning and performance deterioration.
To mitigate the instability issue of unmixed sampling strategy, we propose an improved optimizer for multi-task training, named as MT-AdamW. The core idea is to balance the gradient of each task, by normalizing the gradient of each iteration and compensating it according to the task sampling ratio.
Suppose the -th task is sampled at timestep , the vanilla AdamW [loshchilov2017decoupled] is modified to MT-AdamW by updating the parameters as follows:
is the loss function for the sampled-th task at timestep , and is the learning rate. The weight decay and bias corrections are omitted for simplicity. The original task gradients are first normalized to stabilize training. The scaling factor serves as the loss weight of the sampled task. Then the trimmed gradient
can be used to estimate the first momentand second moment of gradients in a moving average way. To further decouple the gradient contribution and sampling ratio of each task, a task-specific compensation coefficient is used to unbias the estimation and . In practice, if all tasks are expected to contribute equally, all scaling factors could be set as .
Uni-Perceiver v2 performs multi-task training on various tasks and public-available datasets to achieve the general task modeling capability. It uses similar datasets as in Uni-Perceiver [zhu2022uni_p]. Specifically, the image classification task is trained on ImageNet-1k [deng2009imagenet]
dataset. For objection detection and instance segmentation, COCO[lin2014microsoft] is used for training. For image captioning and image-text retrieval, we use a combination of image-text-pair datasets: SBU Captions [ordonez2011im2text], Visual Genome [krishna2017visual], COCO Caption [Chen2015MicrosoftCC], CC3M [sharma2018conceptual], CC12M [changpinyo2021cc12m] and YFCC [yfcc]. We also add the language modeling task during training, which is trained on BookCorpus [zhu2015aligning] and English Wikipedia (Books&Wiki).
During the evaluation, we evaluate generalist models on the most representative datasets for the pillar vision and vision-language tasks listed in Tab. 1. Specifically, ImageNet-1k [deng2009imagenet] and COCO Caption [Chen2015MicrosoftCC] are utilized to evaluate the performance of image classification and image caption, respectively. For image-text retrieval, COCO Caption and Flickr30k [plummer2015flickr30k] are utilized. Note that Flickr30k is not involved in training. For objection detection and instance segmentation, COCO [lin2014microsoft] is used to evaluate their performances. We put the licenses of all datasets in the Appendix.
5.2 Implementation Details
We implement three Uni-Perceiver v2 variants with different backbones, i.e., ResNet-50 [he2016deep], Swin-Base [liu2021swin], and Swin-Large. ResNet-50 is pre-trained on ImageNet-1k, and Swin-Base is pre-trained on ImageNet-21k. Swin-Large is firstly pre-trained on ImageNet-21k and then trained on the detection task with Object365 [shao2019objects365]. The number of feature scales is set to 4 for all models. A Transformer [vaswani2017attention]-based region proposal network is used to generate general region proposals, whose architecture and settings mainly follow Mask DINO [li2022mask]
. However, we replace all multi-category classifiers with binary classifiers. In addition, the number of global attention pooling to extract global features is set to. We choose the pre-trained RoBERTa [liu2019roberta] as the text encoder, which is jointly tuned with the whole network. The unified decoder is also a Transformer-based network, whose parameters are initialized randomly and optimized from scratch. Its architecture follows the setting of the BERT [devlin2018bert] model, but it only consists of 6 Transformer layers. To mitigate the task interference issue in multi-task learning, we also employ the attribute-level Conditional MoE [zhu2022uni] in all FFN layers of the unified decoder. Please refer to the Appendix for more details.
Unless specifically stated, we adopt the unmixed sampling strategy, which only samples one task for all GPUs per iteration. The MT-AdamW optimizer with a base learning rate of 0.0001 and a weight decay of 0.0001 is utilized. The learning rate of modality-specific encoders is multiplied by 0.1 since they have already been pre-trained. Uni-Perceiver v2 with Swin-Base and Swin-Large backbone is trained for 200,000 iterations on 32 and 64 NVIDIA A100 GPUs, respectively. The learning rate drops to at the 160,000 iterations. For models with ResNet-50, we only train them on 16 NVIDIA A100 GPUs for 150,000 iterations. For other training settings, please also refer to the Appendix.
5.3 Ablation Studies
In the following, we evaluate the key components of Uni-Perceiver v2 with ResNet-50 backbone by evaluating its performance on four tasks, i.e., image detection on COCO, image classification on ImageNet-1k, image-text retrieval on COCO caption, and image captioning on COCO caption. The instance segmentation and language modeling tasks are not included to save training costs, and the YFCC dataset is also excluded from the training. Note that, the performance on these datasets are reported without any task-specific fine-tuning. If not stated, COCO detection pre-trained ResNet-50 is used for ablation studies to accelerate the convergence of multi-task training.
Effectiveness of Global and Regional Image Representations. Uni-Perceiver v2 encodes images as the concatenation of global and regional representations. To evaluate their effectiveness on different tasks, we conduct experiments that employ different representations, i.e., only using global representations, only using regional representation only, and using both. Results in Tab. 2 show that: (1) regional representation is crucial for both captioning and retrieval tasks. We speculate that this is because regional proposals can provide localization clues, which is helpful to process both tasks. (2) Compared with regional-only representations, global representations deliver better results on the image classification task, which indicates global representations are important for image-level tasks. (3) Combining global and regional representation allows the two representations to complement each other, and thus achieve the best overall results on all tasks. Therefore, in our subsequent experiments, combining global and regional representations is taken as the default setting.
|Global + Regional||49.9||76.9||51.3||38.8||30.6|
Task Collaboration and Interference. To analyze the collaboration and interference between different tasks, we conduct experiments by removing each task independently from the joint-training tasks in Tab. 3. If the removal of one task can improve (or degrade) the performance of another task, it can reflect that the former task is detrimental (or beneficial) to the latter one during joint training. For a fair comparison, the Conditional MoEs are not employed except for the last experiment. Results show that without MoEs, other tasks have negative impacts on the training of image-text retrieval. However, the image-text retrieval task could promote the performance of image captioning. The image classification task is also very helpful to image captioning, yet the reverse has no obvious effect. It should be noted that all models employ an image encoder pre-trained on COCO detection, thereby all these tasks can benefit from the pre-trained region proposal network. The results indicate that task interference indeed exists in the multi-task training of generalist models and is more common than task collaboration, suggesting the importance of addressing the task interference issue. By employing Conditional MoEs, the task interference is largely mitigated, resulting in improved results on all tasks.
|w/o Detection||-||76.6 (0.3)||47.0 (1.0)||34.6 (0.1)||30.4 (0.5)|
|w/o Classification||50.1 (0.3)||-||51.6 (5.6)||38.6 (3.9)||25.9 (3.0)|
|w/o Retrieval||49.5 (0.3)||76.3 (0.0)||-||-||27.4 (1.5)|
|w/o Captioning||49.7 (0.1)||76.3 (0.0)||51.2 (5.2)||38.3 (3.6)||-|
|All Tasks w/ MoE||49.9 (0.1)||76.9 (0.6)||51.3 (5.3)||38.8 (4.1)||30.6 (0.7)|
Sampling Strategy and Improved Optimization. We evaluate the effectiveness of the unmixed sampling strategy (i.e., sampleing one task for each iteration) and the proposed MT-AdamW optimizer in Tab. 4. From the results, we observe that the vanilla unmixed sampling strategy that computing the contrastive loss with samples on each GPU have slightly adverse effect on the learning of all tasks when compared with the mixed sampling strategy. With the batch size increased by gathering features across all GPUs, the performance of retrieval tasks can be largely improved. Further introducing the MT-AdamW optimizer leads to more stable multi-task training and consistently improved performance across all tasks.
|Supervised||IN-1k & COCO||49.9||76.9||51.3||38.8||30.6|
|Pix2Seq v2 [pix2seqv2]||132M||-||46.5||38.2||34.9||-||-||-||-||-|
|Unified-IO LARGE [unifiedio]||776M||71.8||-||-||-||-||-||-||-||-|
|Unified-IO XL [unifiedio]||2.9B||79.1||-||-||122.3||-||-||-||-|
|Uni-Perceiver BASE [zhu2022uni_p]||124M||79.2||-||-||32.0||-||64.9||82.3||50.7||71.1|
|Uni-Perceiver LARGE [zhu2022uni_p]||354M||82.7||-||-||35.3||-||67.8||83.7||54.1||74.2|
|Uni-Perceiver-MoE BASE [zhu2022uni]||167M||80.3||-||-||33.2||-||64.6||82.1||51.6||72.4|
|Uni-Perceiver-MoE LARGE [zhu2022uni]||505M||83.4||-||-||35.5||-||67.9||83.6||55.3||75.9|
Effects of Different Image Encoder Pre-training. By integrating off-the-shelf encoder models, Uni-Perceiver v2 is capable of leveraging existing large-scale pre-trained encoders. To analyze the effects of different pre-training, we employ different pre-trained models for image encoders. For models with supervised pre-training, we employ ResNet-50 pre-trained on ImageNet-1k, on ImageNet-21k, or consecutively pre-trained on ImageNet-1k and COCO. For models with weakly-supervised or unsupervised pre-training, we employ ResNet-50 pre-trained with MoCo v2 [chen2020mocov2] or CLIP [clip]. Tab. 5 demonstrates that different pre-training data and methods of image encoders benefit different downstream tasks. Specifically, supervised pre-training methods show the most obvious benefits on downstream tasks similar to it, e.g., ImageNet-21k pre-training delivers the best results on ImageNet-1k classification. Besides, the pre-training on large-scale supervised (ImageNet-21k), weakly-supervised or unsupervised data (CLIP and MoCo v2) is more helpful to vision-language tasks such image-text retrieval and image captioning, which possibly thanks to more general representations.
5.4 Main Results
To further verify the effectiveness of Uni-Perceiver v2, we incorporate more powerful backbones including Swin-Base and Swin-Large, denoted as Uni-Perceiver-v2 BASE and Uni-Perceiver-v2 LARGE, respectively. In addition to the tasks included in the ablation studies, we also incorporate instance segmentation on COCO, language modeling on Books&Wiki, and image captioning / image-text retrieval on YFCC for larger-scale multi-task training.
Comparison with existing Generalist Models. We list the performance of Uni-Perceiver v2 and other generalist models on pillar vision and vision-language tasks in Tab. 6. Since generalist models aim to process different tasks with shared architecture and parameters, the task-specific fine-tuning will lose the general modeling ability. We report the performance of the shared models without any task-specific adaptation. Specifically, Uni-Perceiver-v2 BASE can outperform all previous generalist models on all tasks except the Flickr30k retrieval, even if some methods have model parameters, e.g., Unified-IOXL and Flamingo-3B. The performance disadvantage on Flicker30k may be due to the use of private data by Flamingo-3B. Further Scaling up to Swin-Large backbone, Uni-Perceiver-v2 LARGE obtains the best performance on all tasks. Thanks to the flexibility of general region proposals, Uni-Perceiver v2 supports most pillar tasks among generalist models and can achieve competitive results consistently, which indicates the superior general modeling performance of Uni-Perceiver v2 in both versatility and performance.
Comparison with Specialized Models. We compare Uni-Perceiver v2 with commonly-recognized strong baseline models and previous SoTA generalist models on the pillar tasks in Tab. 2. The results show that Uni-Perceiver v2 significantly decreases the performance gap between generalist models and commonly-recognized strong baselines, which need task-specific fine-tuning. It can achieve comparable results across all tasks except the retrieval task on Flickr30K, which we suspect is because ALIGN [align] use 1.8B private image-text pairs, which is much larger than our training data. In contrast, Uni-Perceiver v2 uses only public data for training.
We propose Uni-Perceiver v2, which is the first generalist model that achieves competitive results on major large-scale vision and vision-language tasks. After being jointly trained on single-modal and multi-modal tasks, Uni-Perceiver v2 achieves competitive performance on a broad range of downstream tasks. As for limitations, our method has not been verified on image generation tasks due to limited computational resources.
Appendix A Architecture Details of the Image Encoder
As shown in Fig. 3, our Uni-Perceiver v2 consists of three main parts: the image encoder, the text encoder, and the unified decoder. In this section, we describe the architecture details of the image encoder.
Backbone Network. Given an input image with height and width , a backbone network (e.g., ResNet [he2016deep], Swin-Transformer [liu2021swin]) is firstly employed to extract the multi-scale feature maps , where is the number of features scales, and the spatial shapes of the feature maps are , , , and . The feature maps are transformed by convolutions to match the hidden dimension of the following Transformer-based region proposal network. The transformed feature maps are denoted as . An additional stride 2 convolution layer is applied on to extract a smaller feature map . is the hidden dimension of the Transformer.
Region Proposal Network. A Transformer-based region proposal network is applied on top of the multi-scale feature maps to generate regional representations. Specifically, in the 4-scale setting which is adopted by Uni-Perceiver v2, the input of the Transformer encoder is the backbone feature maps except the first scale . A deformable Transformer zhu2020deformable encoder is employed to extract multi-scale encoded features whose spatial shapes and dimensions are the same as the corresponding input features. To generate the region proposals, we apply a deformalbe Transformer decoder on the multi-scale encoded features. To construct the input object queries of the Transformer decoder (e.g., ), we predict the objectness and bounding boxes of each feature pixel in the encoded feature maps , and select top- features based on their objectness. The selected features are added to randomly initialized object queries as the input of the Transformer decoder, and their locations serve as the initial guess of the bounding boxes of the region proposals.
The Transformer decoder generates a set of candidate object proposals , where , , and are the semantic, bounding box, and segmentation mask representations of the -th proposal, respectively. Following Mask2Former mask2former and MaskDINO [li2022mask], the segmentation mask representations are obtained by the dot product of the final-layer hidden state of the -th proposal and a per-pixel feature map,
where is a convolution layer followed by a Group Normalization (GN) wu2018group, is a convolution followed by a GN and a bilinear upsampling, and is a
convolution followed by a GN, a ReLU, and aconvolution.
The regional representations are obtained by fusing the semantic, bounding box, and segmentation mask representations,
where denotes the positional encoding of box coordinates. uses adaptive average pooling to scale the mask predictions to the size of . Both and are followed by linear projections to match the feature dimension. Note that the bounding box and segmentation mask representations are detached before fusing.
To reduce the computational cost, we predict objectness for each proposal , and select the top- proposals as the final regional representations. is set as 200 by default in Uni-Perceiver v2.
Loss Function. In non-localization tasks such as image classification, the supervision is applied only on the final predictions of the unified decoder as Eq. 7, and there is no special supervision for the proposal generation of the image encoder. In localization tasks such as object detection, additional supervisions are applied for the training of the region proposal network. Specifically, we adopt the contrastive query denoising in MaskDINO [li2022mask] for the training of the Transformer decoder. For better convergence of the region proposal network, we predict objectness, bounding box, and segmentation mask for each proposal at the outputs of Transformer encoder and each Transformer decoder layer, and detection losses with binary classification (i.e., predicting the objectness instead of classes) are applied to each output as an intermediate supervision.
Appendix B Implementation Details
Region Proposal Network. The hyper-parameters used in our region proposal network are listed in Tab. 7. These values mainly follow Mask DINO [li2022mask], but with small modifications. The number of candidate object proposals (‘num_queries’ in Tab. 7) used to generate regional representations is 300 and 900 for the ResNet-50 backbone and Swin backbones, respectively. To reduce the computation cost of the unified decoder, the region proposals are filtered depending on their objectness scores and only region representations are selected as the input for the unified decoder (‘topk_queries’ in Tab. 7). Moreover, to save computation cost, the point loss used in Mask2former mask2former is adopted to calculate mask loss, where the number of sampled points is .
Unified Decoder. As for the Transformer-based unified decoder, a uniform drop rate for stochastic depth is used across all layers and the value is set to 0.1. Unlike Uni-Perceiver series [zhu2022uni_p, zhu2022uni], the layer-scale technique touvron2021cait is not enabled since the instability phenomenon is not observed when the training of the 6-layers unified decoder. In addition, when Conditional MoE is employed in the unified decoder, the number of experts in each layer is set to 8.
Data augmentation. For all tasks except image detection and segmentation, we apply the data augmentation techniques that are similar to Uni-Perceiver [zhu2022uni_p]. However, image resolution is set to and for Swin backbones and for ResNet-50 backbone, respectively. And for object detection and instance segmentation tasks, we first randomly resize the input image with its shorter side between 200 and 1800 pixels and its longer side at most 2400. Then we crop the image to a fixed size of during training. For evaluation, the shorter side is set to 1400, and the maximum longer side is set to 1600.
Others. Tab. 8 lists the batch size, sampling weight , and scaling factor for each task and dataset in the joint training.
|task||dataset||#data||batch size / GPU||sampling weight||scaling factor|
|Image Classification||ImageNet-1k [deng2009imagenet]||1.28M||28||0.1||1.0|
|Object Detection & Instance Segmentation||COCO [lin2014microsoft]||118K||1||0.25||1.0|
|Masked Language Modeling||Books&Wiki [zhu2015aligning]||-||256||0.05||0.5|
|Image Captioning||YFCC [yfcc]||14.8M||24||0.09831||0.16385|
|Visual Genome [krishna2017visual]||108K||24||0.02973||0.04955|
|COCO Caption [Chen2015MicrosoftCC]||113K||24||0.0192||0.032|
|Image-Text Retrieval||YFCC [yfcc]||14.8M||28||0.09831||0.3277|
|Visual Genome [krishna2017visual]||108K||28||0.02973||0.0991|
|COCO Caption [Chen2015MicrosoftCC]||113K||28||0.0192||0.064|
Appendix C Detection on Novel Categories
Thanks to the general task modeling of Uni-Perceiver v2, different tasks can borrow knowledge from each other. For example, object detection task can generalize to novel categories in image classification dataset. Fig. 4 shows the detection result of Uni-Perceiver v2 on images in ImageNet-1k validation set whose categories do not exist in COCO dataset. This demonstrates the generalization ability of Uni-Perceiver v2, indicating the benefit of general task modeling.
Appendix D Licenses of Datasets
Replicate Toronto BookCorpus is open-source and licensed under GNU GPL, Version 3.
Wikipedia Most of Wikipedia’s text is co-licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA) and the GNU Free Documentation License (GFDL) (unversioned, with no invariant sections, front-cover texts, or back-cover texts). Some text has been imported only under CC BY-SA and CC BY-SA-compatible license and cannot be reused under GFDL.
YFCC [yfcc] All the photos and videos provided in YFCC dataset are licensed under one of the Creative Commons copyright licenses.
Visual Genome [krishna2017visual] is licensed under a Creative Commons Attribution 4.0 International License vgterms.