DeepAI
Log In Sign Up

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

Despite the remarkable success of foundation models, their task-specific fine-tuning paradigm makes them inconsistent with the goal of general perception modeling. The key to eliminating this inconsistency is to use generalist models for general task modeling. However, existing attempts at generalist models are inadequate in both versatility and performance. In this paper, we propose Uni-Perceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-language tasks with competitive performance. Specifically, images are encoded as general region proposals, while texts are encoded via a Transformer-based language model. The encoded representations are transformed by a task-agnostic decoder. Different tasks are formulated as a unified maximum likelihood estimation problem. We further propose an improved optimizer to ensure stable multi-task learning with an unmixed sampling strategy, which is helpful for tasks requiring large batch-size training. After being jointly trained on various tasks, Uni-Perceiver v2 is capable of directly handling downstream tasks without any task-specific adaptation. Results show that Uni-Perceiver v2 outperforms all existing generalist models in both versatility and performance. Meanwhile, compared with the commonly-recognized strong baselines that require tasks-specific fine-tuning, Uni-Perceiver v2 achieves competitive performance on a broad range of vision and vision-language tasks.

READ FULL TEXT VIEW PDF
07/28/2022

Pro-tuning: Unified Prompt Tuning for Vision Tasks

In computer vision, fine-tuning is the de-facto approach to leverage pre...
03/30/2022

Task Adaptive Parameter Sharing for Multi-Task Learning

Adapting pre-trained models with broad capabilities has become standard ...
10/07/2022

Polyhistor: Parameter-Efficient Multi-Task Adaptation for Dense Vision Tasks

Adapting large-scale pretrained models to various downstream tasks via f...
02/22/2021

Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer

We propose UniT, a Unified Transformer model to simultaneously learn the...
05/20/2022

UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes

We introduce UViM, a unified approach capable of modeling a wide range o...
07/21/2022

UFO: Unified Feature Optimization

This paper proposes a novel Unified Feature Optimization (UFO) paradigm ...
11/24/2021

One to Transfer All: A Universal Transfer Framework for Vision Foundation Model with Few Data

The foundation model is not the last chapter of the model production pip...

1 Introduction

Figure 1: Comparison of foundation models and Uni-Perceiver v2. and denote the image encoder and text encoder, respectively. In existing foundation models, task-specific decoders , are employed to tune and in different task-specific finetuning. The total number of parameters in adaptation grow with the number of visual/linguistic tasks, denoted as and , respectively. By contrast, our Uni-Perceiver v2 shares all parameters across various downstream tasks with a general decoder , where no task-specific fine-tuning is incorporated. Better than previous generalist models, our method can also effectively handle pillar tasks such as image classification, object detection, instance segmentation, and image-text retrieval.

Learning a general perception model that can handle various modalities and tasks is widely regarded as an important step towards artificial general intelligence. Due to its difficulty, many works (e.g., Florence [yuan2021florence], CoCa [yu2022coca], BEiT-3 [wang2022image]), also known as foundation models [bommasani2021opportunities], instead focus on a fallback solution of learning a general representation encoder that can be adapted (e.g., fine-tuned) to various downstream tasks. By performing large-scale pre-training on massive multi-modal task-agnostic data, these works have demonstrated the superiority by pushing the state-of-the-art results on a broad range of tasks including single-modal tasks (e.g., image classification and object detection) and also cross-modal tasks (e.g., image captioning and image retrieval).

Despite the success, there is still a considerable gap between foundation models and the goal of general perception modeling. While foundation models only focus on general representation learning, task modeling is neglected. Traditional task-specific fine-tuning paradigm is still utilized (see Fig. 1). This significantly increases the marginal cost of adapting pre-trained models to various downstream tasks, making it difficult to meet the rapidly growing demands of diverse downstream tasks and scenarios. Such a task-specific fine-tuning paradigm of foundation models is inconsistent with the goal of general perception modeling.

Instead of performing task-specific fine-tuning, generalist models process different tasks with shared architecture and parameters, which is aligned with the goal of general perception modeling. It not only reduces the cost of handling diverse tasks but also enables task collaboration. Most existing attempts on generalist models are sequence-to-sequence (seq2seq) models [pix2seqv2, ofa, unifiedio, alayrac2022flamingo, gpv1, gpv2, gato, unitab]. However, these attempts are inadequate in both versatility and performance: (1) some pillar vision and vision-language tasks as listed in Tab. 1 cannot be handled, e.g., image-text retrieval, object detection, and instance segmentation; (2) the accuracy and inference speed still lag significantly behind state-of-the-art task-specific methods. Another line of research named Uni-Perceivers [zhu2022uni, zhu2022uni_p] builds generalist models supporting both generation and non-generation tasks. Nevertheless, they still cannot handle many vital tasks such as detection and segmentation.

To develop generalist models with better versatility and performance, our core idea is to encode images as general region proposals consisting of the semantic, bounding box and segmentation mask representations. Compared with previous methods where images are represented as non-overlapping patches, this design makes our localization modeling more expressive and flexible. This explicit utilization of localization clues not only greatly reduces the difficulty of handling localization tasks such as image detection and segmentation, but also provides richer features for non-localization tasks, thus enabling more general task modeling and better performance.

In this paper, we propose Uni-Perceiver v2 as a generalist model capable of handling major large-scale vision and vision-language tasks as listed in Tab. 1. Specifically, images are encoded as a concatenation of global and regional representations via a region proposal network, while texts are encoded via a Transformer-based language model. Both the image and text encoders can benefit from off-the-shelf pre-trained models, which reduces the demand for training data and resources and ensures performance. The encoded representations are transformed by a shared modality-agnostic Transformer [vaswani2017attention] network to obtain the decoded representations. Following Uni-Perceivers [zhu2022uni_p, zhu2022uni], different tasks are formulated as a unified maximum likelihood estimation problem and are jointly learned to enable general task adaptation. We further propose an improved optimizer named MT-AdamW to ensure stable multi-task learning with an unmixed sampling strategy which only samples one task for all GPUs per iteration. This is very helpful for tasks requiring large batch size training.

Uni-Perceiver v2 is the first generalist model achieving competitive results on major large-scale vision and vision-language tasks including object detection, instance segmentation, image classification, image captioning, and image-text retrieval, except for image generation that has not been verified due to limited computational resources. After being jointly trained on various tasks, it can directly handle a broad range of tasks without any task-specific adaption, achieving state-of-the-art performance among existing generalist models. Our contributions are summarized as:

  • We propose Uni-Perceiver v2, which is the first generalist model capable of handling both localization and non-localization tasks with competitive performance. The general region proposal encoding of images brings more flexible and expressive localization modeling.

  • To improve the effectiveness of multi-task learning, we adopt an unmixed sampling strategy to enable large batch-size training and develop an improved optimizer named MT-AdamW to mitigate the instability in gradients.

  • Uni-Perceiver v2 outperforms all existing generalist models in both versatility and performance. Without any task-specific adaption, Uni-Perceiver v2 achieves competitive performance on a broad range of downstream tasks compared with commonly-recognized strong baselines that require task-specific fine-tuning, demonstrating its strong ability of general task modeling.

Categories Specific Tasks
Retrieval Image-text retrieval
Classification Image classification
Region categorization
Situation recognition
Localization Object detection
Key point detection
Pose estimation
Referring expression grounding
Human object interaction
Relation detection
Optical character recognition
Object localization
Mask Predication Instance segmentation
Semantic segmentation
Panoptic segmentation
Image Generation Image synthesis
Image inpainting
Segment-based image generation
Style transferring
Depth estimation
Surface normal estimation
Image infilling

Image super resolution

Image to Text Image captioning
Visual question answering
Region captioning
Grounded VQA
Grounded captioning
Visual commonsense reasoning
Table 1: Categories of mainstream vision and vision-language tasks. Pillar tasks of different downstream task categories are in bold. These pillar tasks are the most representative tasks in each category, where other tasks can be derived from them. Uni-Perceiver v2 is able to effectively handle the underlined pillar tasks, except for image synthesis that has not been verified due to limited computational resources.

2 Related Work

Foundation Vision Models  are “designed to be adapted (e.g., fine-tuned) to various downstream tasks by pre-training on broad data at scale” [bommasani2021opportunities]. Such large-scale pre-trained vision models have shown effectiveness in enriching data encoding capacity, alleviating data hunger, and improving the performance of downstream tasks.

Image classification on ImageNet-1k 

[deng2009imagenet]

has been the mainstream pre-training paradigm for a long period. However, as the model size grows, larger annotated datasets are required to avoid over-fitting in pre-training, such as ImageNet-21k 

[deng2009imagenet], Instagram-1B [mahajan2018exploring], JFT-300M [sun2017revisiting] and JFT-3B [zhai2022scaling]. Inspired by the success of linguistic pre-training on massive web-crawled text, CLIP [clip] and ALIGN [align] have begun to focus on multi-modal contrastive pre-training on web-scale noisy image-text pairs to learn aligned image and text representations. SimVLM [wang2021simvlm] employs the multi-modal sequence generation task for pre-training. FLAVA [singh2021flava] combines contrastive and generative pre-training to handle both unimodal and multimodal tasks. UniCL [yang2022unified] and CoCa [yu2022coca] jointly use human-annotated and web-crawled data. Florence [yuan2021florence] and INTERN [shao2021intern] increase the scale and diversity of pre-training data to enhance the representation capability. OmniVL [wang2022omnivl] proposes to incorporate both image-language and video-language tasks in its pre-training. BEiT-3 [wang2022image] unifies pre-training objectives for different modalities as a single masked data modeling task, achieving state-of-the-art results on a wide range of downstream tasks.

These works on foundation models only focus on general representation learning, while neglecting task modeling. When adapting them to downstream tasks, the traditional task-specific fine-tuning paradigm is still utilized, which is inconsistent with the goal of general perception modeling. Meanwhile, with the rapidly growing demands of diverse tasks and scenarios, the task-specific fine-tuning paradigm would result in a prohibitive marginal cost for data collection, data annotation, model training, and model storage.

Generalist models 

handle various tasks with shared architecture and parameters, which have been long pursued by the machine learning community. Recently, inspire by the success of sequence-to-sequence (seq2seq) models in NLP field 

[radford2018improving], OFA [ofa], Flamingo [alayrac2022flamingo], and GIT [wang2022git] propose to model various tasks as a sequence generation task. Unified-IO [unifiedio], Pix2Seq v2 [pix2seqv2], and UniTab [unitab] further develop this method to support more tasks by introducing discrete coordinate tokens, thus location information can be encoded or decoded by the unified models. Beyond that, Gato [gato]

succeeds in unifying reinforcement learning tasks into the seq2seq framework. GPV 

[gpv1] also builds a general-purpose vision system by adding a seq2seq module on a DETR [carion2020end]-based visual encoder.

However, these methods with seq2seq formulation are still inadequate in both versatility and performance: (1) They cannot handle some core vision tasks, e.g., image-text retrieval, object detection, and instance segmentation. Although Pix2Seq v2 [pix2seqv2] includes detection and instance segmentation tasks, its performance and inference speed still lag significantly behind state-of-the-art task-specific methods [zhang2022dino, li2022mask]

; (2) The non-parallel auto-regressive decoding leads to slow inference speed. For example, image classification requires calculating and comparing the cumulative probabilities of all category names conditioned on the given image; (3) They also suffer from the task-interference issue in multi-task learning, resulting in performance degradation compared with task-specific models.

Alternatively, Uni-Perceivers [zhu2022uni_p, zhu2022uni] formulate different tasks as finding the maximum likelihood target for each input through the representation similarity regardless of their modality, making it possible to support both generation and non-generation tasks. Nevertheless, they still cannot handle image detection and segmentation tasks.

3 Revisiting Uni-Perceivers

Unified Modeling of Perception Tasks.  Uni-Perceiver [zhu2022uni_p] proposes to reformulate different tasks as a unified maximum likelihood estimation problem. Specifically, each task is defined with a set of inputs and a set of candidate targets from arbitrary combinations of modalities. The inputs and targets are first encoded with a modality-specific tokenizer with linear projection. Then the encoded representations are transformed by modality-agnostic decoder with shared parameters for different tasks. Given an input, the unified task objective is defined as finding the target with the maximum likelihood with the input.

Mitigating Task Interference.  Multi-task learning with fully shared parameters could introduce interference between different tasks. Uni-Perceiver-MoE [zhu2022uni] proposes Conditional MoEs to address the task-interference issue. Specifically, for each input token, a routing decision is calculated depending on specific routing strategy, which sparsely activates a small portion of experts to process this token. The corresponding output of an input token is the linearly weighted combination of those selected experts by the routing decision. Conditional MoEs mitigate the interference issue by allowing conflicting modalities and tasks using separate parameters without introducing any task-specific modules.

Limitations.  Although Uni-Perceivers aim to process different tasks with a unified architecture, it fails to handle detection and segmentation tasks due to the lack of localization information in its encoded features. Meanwhile, Uni-Perceivers do not integrate off-the-shelf encoder models, making it unable to benefit from existing large-scale pre-trained encoders. This potentially increases its demand for pre-training data and resources, limiting its performance.

4 Method

4.1 Encoding Images as General Region Proposals

Most existing generalist models [zhu2022uni_p, zhu2022uni] represent images as non-overlapping patches with fixed sizes. This design is rather coarse and limited in modeling objects of varying sizes and shapes in images, making it difficult to handle localization tasks such as detection and segmentation.

In order to enable more expressive and flexible localization modeling, we propose to encode the input image as a sequence of general region proposals. Specifically, given an input image with height and width , a network is employed to encode the image as the concatenation of global and regional representations as

(1)

where are the global representations of the whole image, and are the regional representations of candidate object proposals in the image.

Following the common practice in localization tasks, an image backbone network (e.g., ResNet [he2016deep]) is firstly employed to extract the multi-scale feature maps , where is the number of feature scales (e.g., ).

Regional Representations.  A Transformer [vaswani2017attention]-based region proposal network is applied on top of the multi-scale feature maps to extract a set of candidate object proposals , where , , and are the semantic, bounding box, and segmentation mask representations of the -th proposal, respectively. The region proposal network is similar to MaskDINO [li2022mask], but only considers foreground-background binary classification. See Appendix for detailed implementation. These three representations are then fused as the regional representation as

(2)

where denotes the positional encoding of box coordinates. uses adaptive average pooling to scale the mask predictions to the size of . Both and are followed by linear projections to match the feature dimension.

Global Representations. The global representations are extracted from the last-scale feature map with height and width . instances of parameterized Attention Pooling [clip] are employed to extract global features. The pooled features are concatenated with the flattened feature map to obtain the global representations as

(3)

4.2 Encoding Text with Language Models

A Transformer [vaswani2017attention]-based language model is used to encode textual data, such as category names in classification tasks, image descriptions in image-text retrieval tasks, and the vocabulary in image captioning tasks. Specifically, a BPE tokenizer [sennrich2015neural] tokenizes the input text into a sequence of word embeddings, and a Transformer encoder is employed to extracts the text feature sequence as

(4)

where is the encoded feature of the -th word, and is the sequence length. In our implementation, we use a pre-trained RoBERTa [liu2019roberta] as the text encoder, which is jointly tuned with the whole network.

4.3 General Task Adaptation

We follow Uni-Perceivers [zhu2022uni_p, zhu2022uni] to formulate different tasks as a unified maximum likelihood estimation problem. Given an input and the candidate target set , the task objective is defined as finding the target with the maximum likelihood as

(5)

where the likelihood

is estimated from the cosine similarity between the representations of

and as

(6)

where is the modality-specific encoders and introduced in Sec. 4.1 and 4.2, respectively. is a modality-agnostic Transformer [vaswani2017attention] network shared for different tasks, and is a learnable temperature parameter.

Depending on task requirements, the modality-specific encoded representation for inputs can be an image feature sequence , a text feature sequence , or their concatenation, with an additional <SPE> token inserted at the beginning. The encoded representation for targets is constructed in the same way.

To obtain general task modeling capability, Uni-Perceiver v2 conducts multi-task learning on various uni-modal and multi-modal tasks. Denoting a set of tasks as , where and are the input set and target set of the -th task, respectively. The training loss is

(7)

where and denote the sampling ratio and loss weight of the -th task, respectively. The sampling ratio are normalized as . We refer to Sec. 4.4 for detailed discussions of the sampling strategy. To mitigate the task interference in multi-task training, we follow Uni-Perceiver-MoE [zhu2022uni] to employ the Conditional MoEs with attribute-level routing strategy for effective multi-task training.

Tasks with Localization. Uni-Perceiver v2 can perform localization tasks such as object detection and instance segmentation by decoding the regional representations. Specifically, for each region proposal , its outputted feature from the unified decoder will be compared with class embeddings to obtain the class prediction as in Eq. (5). The corresponding bounding box and segmentation mask will serve as the localization predictions.

Tasks without Localization. Uni-Perceiver v2 can also handle tasks that do need localization predictions, e.g., image classification, image captioning, image-text retrieval. It follows a similar formulation of Uni-Perceiver for these tasks with two major differences: (1) More expressive and flexible localization clues for images, better facilitating these tasks; (2) Both the image and text encoders can leverage off-the-shelf modality-specific pre-trained models, leading to better performance.

4.4 Sampling Strategy and Improved Optimization

Optimizing generalist models follows the paradigm of multi-task learning, which performs joint training on data from different tasks. Current methods usually mix all tasks in one training iteration [zhu2022uni_p, ofa, unifiedio]. Such mixed sampling strategy limits the batch-size of each task, which can be detrimental for tasks that benefit from large batch-size training (e.g., image-text retrieval).

A straightforward solution is to sample only one task per iteration, which we refer as unmixed sampling strategy. It can achieve the largest training batch-size. However, when different iterations sample different tasks, the gradients would vary greatly due to the differences in data and tasks, which may bring potential instability to multi-task learning and performance deterioration.

To mitigate the instability issue of unmixed sampling strategy, we propose an improved optimizer for multi-task training, named as MT-AdamW. The core idea is to balance the gradient of each task, by normalizing the gradient of each iteration and compensating it according to the task sampling ratio.

Suppose the -th task is sampled at timestep , the vanilla AdamW [loshchilov2017decoupled] is modified to MT-AdamW by updating the parameters as follows:

where

is the loss function for the sampled

-th task at timestep , and is the learning rate. The weight decay and bias corrections are omitted for simplicity. The original task gradients are first normalized to stabilize training. The scaling factor serves as the loss weight of the sampled task. Then the trimmed gradient

can be used to estimate the first moment

and second moment of gradients in a moving average way. To further decouple the gradient contribution and sampling ratio of each task, a task-specific compensation coefficient is used to unbias the estimation and . In practice, if all tasks are expected to contribute equally, all scaling factors could be set as .

5 Experiments

5.1 Datasets

Uni-Perceiver v2 performs multi-task training on various tasks and public-available datasets to achieve the general task modeling capability. It uses similar datasets as in Uni-Perceiver [zhu2022uni_p]. Specifically, the image classification task is trained on ImageNet-1k [deng2009imagenet]

dataset. For objection detection and instance segmentation, COCO 

[lin2014microsoft] is used for training. For image captioning and image-text retrieval, we use a combination of image-text-pair datasets: SBU Captions [ordonez2011im2text], Visual Genome [krishna2017visual], COCO Caption [Chen2015MicrosoftCC], CC3M [sharma2018conceptual], CC12M [changpinyo2021cc12m] and YFCC [yfcc]. We also add the language modeling task during training, which is trained on BookCorpus [zhu2015aligning] and English Wikipedia (Books&Wiki).

During the evaluation, we evaluate generalist models on the most representative datasets for the pillar vision and vision-language tasks listed in Tab. 1. Specifically, ImageNet-1k [deng2009imagenet] and COCO Caption [Chen2015MicrosoftCC] are utilized to evaluate the performance of image classification and image caption, respectively. For image-text retrieval, COCO Caption and Flickr30k [plummer2015flickr30k] are utilized. Note that Flickr30k is not involved in training. For objection detection and instance segmentation, COCO [lin2014microsoft] is used to evaluate their performances. We put the licenses of all datasets in the Appendix.

5.2 Implementation Details

We implement three Uni-Perceiver v2 variants with different backbones, i.e., ResNet-50 [he2016deep], Swin-Base [liu2021swin], and Swin-Large. ResNet-50 is pre-trained on ImageNet-1k, and Swin-Base is pre-trained on ImageNet-21k. Swin-Large is firstly pre-trained on ImageNet-21k and then trained on the detection task with Object365 [shao2019objects365]. The number of feature scales is set to 4 for all models. A Transformer [vaswani2017attention]-based region proposal network is used to generate general region proposals, whose architecture and settings mainly follow Mask DINO [li2022mask]

. However, we replace all multi-category classifiers with binary classifiers. In addition, the number of global attention pooling to extract global features is set to

. We choose the pre-trained RoBERTa [liu2019roberta] as the text encoder, which is jointly tuned with the whole network. The unified decoder is also a Transformer-based network, whose parameters are initialized randomly and optimized from scratch. Its architecture follows the setting of the BERT [devlin2018bert] model, but it only consists of 6 Transformer layers. To mitigate the task interference issue in multi-task learning, we also employ the attribute-level Conditional MoE [zhu2022uni] in all FFN layers of the unified decoder. Please refer to the Appendix for more details.

Unless specifically stated, we adopt the unmixed sampling strategy, which only samples one task for all GPUs per iteration. The MT-AdamW optimizer with a base learning rate of 0.0001 and a weight decay of 0.0001 is utilized. The learning rate of modality-specific encoders is multiplied by 0.1 since they have already been pre-trained. Uni-Perceiver v2 with Swin-Base and Swin-Large backbone is trained for 200,000 iterations on 32 and 64 NVIDIA A100 GPUs, respectively. The learning rate drops to at the 160,000 iterations. For models with ResNet-50, we only train them on 16 NVIDIA A100 GPUs for 150,000 iterations. For other training settings, please also refer to the Appendix.

5.3 Ablation Studies

In the following, we evaluate the key components of Uni-Perceiver v2 with ResNet-50 backbone by evaluating its performance on four tasks, i.e., image detection on COCO, image classification on ImageNet-1k, image-text retrieval on COCO caption, and image captioning on COCO caption. The instance segmentation and language modeling tasks are not included to save training costs, and the YFCC dataset is also excluded from the training. Note that, the performance on these datasets are reported without any task-specific fine-tuning. If not stated, COCO detection pre-trained ResNet-50 is used for ablation studies to accelerate the convergence of multi-task training.

Effectiveness of Global and Regional Image Representations. Uni-Perceiver v2 encodes images as the concatenation of global and regional representations. To evaluate their effectiveness on different tasks, we conduct experiments that employ different representations, i.e., only using global representations, only using regional representation only, and using both. Results in Tab. 2 show that: (1) regional representation is crucial for both captioning and retrieval tasks. We speculate that this is because regional proposals can provide localization clues, which is helpful to process both tasks. (2) Compared with regional-only representations, global representations deliver better results on the image classification task, which indicates global representations are important for image-level tasks. (3) Combining global and regional representation allows the two representations to complement each other, and thus achieve the best overall results on all tasks. Therefore, in our subsequent experiments, combining global and regional representations is taken as the default setting.

Representation COCO ImageNet-1k COCO COCO
Types Detection Classification Retrieval Caption
Global - 76.8 46.3 34.6 28.8
Regional 48.2 75.9 52.3 39.2 31.2
Global + Regional 49.9 76.9 51.3 38.8 30.6
Table 2: Ablation of different representation types for general region proposals. Results are reported on object detection (mAP), image classification (Acc), image-text retrieval (I2T R@1 and T2I R@1), and image caption (BLEU-4).

Task Collaboration and Interference. To analyze the collaboration and interference between different tasks, we conduct experiments by removing each task independently from the joint-training tasks in Tab. 3. If the removal of one task can improve (or degrade) the performance of another task, it can reflect that the former task is detrimental (or beneficial) to the latter one during joint training. For a fair comparison, the Conditional MoEs are not employed except for the last experiment. Results show that without MoEs, other tasks have negative impacts on the training of image-text retrieval. However, the image-text retrieval task could promote the performance of image captioning. The image classification task is also very helpful to image captioning, yet the reverse has no obvious effect. It should be noted that all models employ an image encoder pre-trained on COCO detection, thereby all these tasks can benefit from the pre-trained region proposal network. The results indicate that task interference indeed exists in the multi-task training of generalist models and is more common than task collaboration, suggesting the importance of addressing the task interference issue. By employing Conditional MoEs, the task interference is largely mitigated, resulting in improved results on all tasks.

Tasks COCO ImageNet-1k COCO COCO
Detection Classification Retrieval Caption
Single Task 50.1 76.1 50.0 37.6 30.2
All Tasks 49.8 76.3 46.0 34.7 28.9
w/o Detection - 76.6 (0.3) 47.0 (1.0) 34.6 (0.1) 30.4 (0.5)
w/o Classification 50.1 (0.3) - 51.6 (5.6) 38.6 (3.9) 25.9 (3.0)
w/o Retrieval 49.5 (0.3) 76.3 (0.0) - - 27.4 (1.5)
w/o Captioning 49.7 (0.1) 76.3 (0.0) 51.2 (5.2) 38.3 (3.6) -
All Tasks w/ MoE 49.9 (0.1) 76.9 (0.6) 51.3 (5.3) 38.8 (4.1) 30.6 (0.7)
Table 3: Ablation of collaboration and interference between tasks. All experiments except for the last line do not employ Conditional MoEs. In the brackets are the gaps to the “All Tasks” counterpart. In green and red are the gaps of at least 0.5 point.
Task Gather MT-AdamW COCO ImageNet-1k COCO COCO
Sampling Feature Optimizer Detection Classification Retrieval Caption
mixed 49.6 76.7 40.1 31.9 27.6
unmixed 49.2 76.6 39.8 30.9 27.5
unmixed 49.3 76.8 50.4 37.3 27.6
unmixed 49.9 76.9 51.3 38.8 30.6
Table 4: Ablation of sampling strategies and improved optimizer. “mixed” means mixing different tasks’ data in one iteration, while “unmixed” denotes that only one task’s data is sampled in one iteration. “Gather Feature” means that negative samples for retrieval tasks are collected synchronously across GPUs.

Sampling Strategy and Improved Optimization. We evaluate the effectiveness of the unmixed sampling strategy (i.e., sampleing one task for each iteration) and the proposed MT-AdamW optimizer in Tab. 4. From the results, we observe that the vanilla unmixed sampling strategy that computing the contrastive loss with samples on each GPU have slightly adverse effect on the learning of all tasks when compared with the mixed sampling strategy. With the batch size increased by gathering features across all GPUs, the performance of retrieval tasks can be largely improved. Further introducing the MT-AdamW optimizer leads to more stable multi-task training and consistently improved performance across all tasks.

Pretrained Pretrained COCO ImageNet-1k COCO COCO
Method Data Detection Classification Retrieval Caption
Supervised IN-1k 45.7 76.8 51.2 38.9 27.3
Supervised IN-21k 48.3 80.1 55.1 41.2 30.2
Supervised IN-1k & COCO 49.9 76.9 51.3 38.8 30.6
MoCo v2 IN-1k 48.3 75.0 54.8 40.5 29.6
CLIP CLIP data 47.2 73.8 55.3 41.3 32.0
Table 5: Ablation of different pre-trained image encoders.
Methods #params
Image
Classification
Object
Detection
Instance
Segmentation
Image
Captioning
Text
Retrieval
Image
Retrieval
ImageNet-1k COCO COCO COCO COCO Flickr30k COCO Flickr30k
Acc mAP mAP B@4 CIDEr R@1 R@1 R@1 R@1
Pix2Seq v2 [pix2seqv2] 132M - 46.5 38.2 34.9 - - - - -
UniTab [unitab] 185M - - - - 115.8 - - - -
Unified-IO LARGE [unifiedio] 776M 71.8 - - - - - - - -
Unified-IO XL [unifiedio] 2.9B 79.1 - - 122.3 - - - -
Flamingo-3B [alayrac2022flamingo] 3.2B - - - - - 65.9 89.3 48.0 79.5
Uni-Perceiver BASE [zhu2022uni_p] 124M 79.2 - - 32.0 - 64.9 82.3 50.7 71.1
Uni-Perceiver LARGE [zhu2022uni_p] 354M 82.7 - - 35.3 - 67.8 83.7 54.1 74.2
Uni-Perceiver-MoE BASE [zhu2022uni] 167M 80.3 - - 33.2 - 64.6 82.1 51.6 72.4
Uni-Perceiver-MoE LARGE [zhu2022uni] 505M 83.4 - - 35.5 - 67.9 83.6 55.3 75.9
Uni-Perceiver-v2 BASE 308M 86.3 58.6 50.6 35.4 116.9 71.8 88.1 55.6 73.8
Uni-Perceiver-v2 LARGE 446M 87.2 61.9 53.6 36.5 122.5 75.0 89.3 58.5 79.6
(+3.8) (+15.4) (+15.4) (+1.6) (+0.2) (+7.1) (+0.0) (+3.2) (+0.1)
Table 6: Comparison of our Uni-Perceiver v2 to recent generalist models on six pillar visual and visual-linguistic tasks listed in Tab. 1. Note that we only report the results without any task-specific fine-tuning. Uni-Perceiver v2 is the the first generalist model to support all these pillar tasks and can achieve competitive results without any task-specific adaption. Some generalist models that only report results with task-specific fine-tuning are not included, e.g., , OFA [ofa] and GIT [wang2022git]. “#params” is the number of parameters required during model deployment for cross-modal tasks. Results with the best performance are in bold, and previous SoTA results are underlined.
Figure 2: Comparison with generalist models and commonly-recognized strong task-specific models on pillar vision and vision-language tasks. For generalist models including Uni-Perceiver v2, we only report the results without any task-specific fine-tuning. Uni-Perceiver v2 (Uni-P v2) is compared with competitive specialized models, i.e., Swin-large [liu2021swin], DINO [zhang2022dino], Mask DINO [li2022mask], OSCAR-L [li2020oscar] and ALIGN [align], and previous SoTA generalists, i.e., Uni-P-MoE-L [zhu2022uni], Pix2seq v2 [pix2seqv2], and Flamingo-3B [alayrac2022flamingo].

Effects of Different Image Encoder Pre-training.  By integrating off-the-shelf encoder models, Uni-Perceiver v2 is capable of leveraging existing large-scale pre-trained encoders. To analyze the effects of different pre-training, we employ different pre-trained models for image encoders. For models with supervised pre-training, we employ ResNet-50 pre-trained on ImageNet-1k, on ImageNet-21k, or consecutively pre-trained on ImageNet-1k and COCO. For models with weakly-supervised or unsupervised pre-training, we employ ResNet-50 pre-trained with MoCo v2 [chen2020mocov2] or CLIP [clip]. Tab. 5 demonstrates that different pre-training data and methods of image encoders benefit different downstream tasks. Specifically, supervised pre-training methods show the most obvious benefits on downstream tasks similar to it, e.g., ImageNet-21k pre-training delivers the best results on ImageNet-1k classification. Besides, the pre-training on large-scale supervised (ImageNet-21k), weakly-supervised or unsupervised data (CLIP and MoCo v2) is more helpful to vision-language tasks such image-text retrieval and image captioning, which possibly thanks to more general representations.

5.4 Main Results

To further verify the effectiveness of Uni-Perceiver v2, we incorporate more powerful backbones including Swin-Base and Swin-Large, denoted as Uni-Perceiver-v2 BASE and Uni-Perceiver-v2 LARGE, respectively. In addition to the tasks included in the ablation studies, we also incorporate instance segmentation on COCO, language modeling on Books&Wiki, and image captioning / image-text retrieval on YFCC for larger-scale multi-task training.

Comparison with existing Generalist Models. We list the performance of Uni-Perceiver v2 and other generalist models on pillar vision and vision-language tasks in Tab. 6. Since generalist models aim to process different tasks with shared architecture and parameters, the task-specific fine-tuning will lose the general modeling ability. We report the performance of the shared models without any task-specific adaptation. Specifically, Uni-Perceiver-v2 BASE can outperform all previous generalist models on all tasks except the Flickr30k retrieval, even if some methods have model parameters, e.g., Unified-IOXL and Flamingo-3B. The performance disadvantage on Flicker30k may be due to the use of private data by Flamingo-3B. Further Scaling up to Swin-Large backbone, Uni-Perceiver-v2 LARGE obtains the best performance on all tasks. Thanks to the flexibility of general region proposals, Uni-Perceiver v2 supports most pillar tasks among generalist models and can achieve competitive results consistently, which indicates the superior general modeling performance of Uni-Perceiver v2 in both versatility and performance.

Comparison with Specialized Models. We compare Uni-Perceiver v2 with commonly-recognized strong baseline models and previous SoTA generalist models on the pillar tasks in Tab. 2. The results show that Uni-Perceiver v2 significantly decreases the performance gap between generalist models and commonly-recognized strong baselines, which need task-specific fine-tuning. It can achieve comparable results across all tasks except the retrieval task on Flickr30K, which we suspect is because ALIGN [align] use 1.8B private image-text pairs, which is much larger than our training data. In contrast, Uni-Perceiver v2 uses only public data for training.

6 Conclusion

We propose Uni-Perceiver v2, which is the first generalist model that achieves competitive results on major large-scale vision and vision-language tasks. After being jointly trained on single-modal and multi-modal tasks, Uni-Perceiver v2 achieves competitive performance on a broad range of downstream tasks. As for limitations, our method has not been verified on image generation tasks due to limited computational resources.

References

Appendix A Architecture Details of the Image Encoder

As shown in Fig. 3, our Uni-Perceiver v2 consists of three main parts: the image encoder, the text encoder, and the unified decoder. In this section, we describe the architecture details of the image encoder.

Backbone Network. Given an input image with height and width , a backbone network (e.g., ResNet [he2016deep], Swin-Transformer [liu2021swin]) is firstly employed to extract the multi-scale feature maps , where is the number of features scales, and the spatial shapes of the feature maps are , , , and . The feature maps are transformed by convolutions to match the hidden dimension of the following Transformer-based region proposal network. The transformed feature maps are denoted as . An additional stride 2 convolution layer is applied on to extract a smaller feature map . is the hidden dimension of the Transformer.

Region Proposal Network. A Transformer-based region proposal network is applied on top of the multi-scale feature maps to generate regional representations. Specifically, in the 4-scale setting which is adopted by Uni-Perceiver v2, the input of the Transformer encoder is the backbone feature maps except the first scale . A deformable Transformer zhu2020deformable encoder is employed to extract multi-scale encoded features whose spatial shapes and dimensions are the same as the corresponding input features. To generate the region proposals, we apply a deformalbe Transformer decoder on the multi-scale encoded features. To construct the input object queries of the Transformer decoder (e.g., ), we predict the objectness and bounding boxes of each feature pixel in the encoded feature maps , and select top- features based on their objectness. The selected features are added to randomly initialized object queries as the input of the Transformer decoder, and their locations serve as the initial guess of the bounding boxes of the region proposals.

The Transformer decoder generates a set of candidate object proposals , where , , and are the semantic, bounding box, and segmentation mask representations of the -th proposal, respectively. Following Mask2Former mask2former and MaskDINO [li2022mask], the segmentation mask representations are obtained by the dot product of the final-layer hidden state of the -th proposal and a per-pixel feature map,

(8)

where is a convolution layer followed by a Group Normalization (GN) wu2018group, is a convolution followed by a GN and a bilinear upsampling, and is a

convolution followed by a GN, a ReLU, and a

convolution.

The regional representations are obtained by fusing the semantic, bounding box, and segmentation mask representations,

(9)

where denotes the positional encoding of box coordinates. uses adaptive average pooling to scale the mask predictions to the size of . Both and are followed by linear projections to match the feature dimension. Note that the bounding box and segmentation mask representations are detached before fusing.

To reduce the computational cost, we predict objectness for each proposal , and select the top- proposals as the final regional representations. is set as 200 by default in Uni-Perceiver v2.

Figure 3: Architecture overview of our Uni-Perceiver v2.

Loss Function. In non-localization tasks such as image classification, the supervision is applied only on the final predictions of the unified decoder as Eq. 7, and there is no special supervision for the proposal generation of the image encoder. In localization tasks such as object detection, additional supervisions are applied for the training of the region proposal network. Specifically, we adopt the contrastive query denoising in MaskDINO [li2022mask] for the training of the Transformer decoder. For better convergence of the region proposal network, we predict objectness, bounding box, and segmentation mask for each proposal at the outputs of Transformer encoder and each Transformer decoder layer, and detection losses with binary classification (i.e., predicting the objectness instead of classes) are applied to each output as an intermediate supervision.

Appendix B Implementation Details

Region Proposal Network.  The hyper-parameters used in our region proposal network are listed in Tab. 7. These values mainly follow Mask DINO [li2022mask], but with small modifications. The number of candidate object proposals (‘num_queries’ in Tab. 7) used to generate regional representations is 300 and 900 for the ResNet-50 backbone and Swin backbones, respectively. To reduce the computation cost of the unified decoder, the region proposals are filtered depending on their objectness scores and only region representations are selected as the input for the unified decoder (‘topk_queries’ in Tab. 7). Moreover, to save computation cost, the point loss used in Mask2former mask2former is adopted to calculate mask loss, where the number of sampled points is .

Unified Decoder. As for the Transformer-based unified decoder, a uniform drop rate for stochastic depth is used across all layers and the value is set to 0.1. Unlike Uni-Perceiver series [zhu2022uni_p, zhu2022uni], the layer-scale technique touvron2021cait is not enabled since the instability phenomenon is not observed when the training of the 6-layers unified decoder. In addition, when Conditional MoE is employed in the unified decoder, the number of experts in each layer is set to 8.

Data augmentation. For all tasks except image detection and segmentation, we apply the data augmentation techniques that are similar to Uni-Perceiver [zhu2022uni_p]. However, image resolution is set to and for Swin backbones and for ResNet-50 backbone, respectively. And for object detection and instance segmentation tasks, we first randomly resize the input image with its shorter side between 200 and 1800 pixels and its longer side at most 2400. Then we crop the image to a fixed size of during training. For evaluation, the shorter side is set to 1400, and the maximum longer side is set to 1600.

Others.  Tab. 8 lists the batch size, sampling weight , and scaling factor for each task and dataset in the joint training.

Item Value
enc_layers 6
dec_layers 6
dim_feedforward   2048
hidden_dim 256
dropout
nheads 8
num_queries 300/900
topk_queries 200
enc_n_points 4
dec_n_points 4
cls_cost_coef
bbox_cost_coef
giou_cost_coef
mask_cost_coef
dice_cost_coef
cls_loss_coef
bbox_loss_coef
giou_loss_coef
mask_loss_coef
dice_loss_coef
dn_box_noise_scale
dn_label_noise_ratio
Table 7: Hyper-parameters used in our region proposal network.
task dataset #data batch size / GPU sampling weight scaling factor
Image Classification ImageNet-1k [deng2009imagenet] 1.28M 28 0.1 1.0
Object Detection & Instance Segmentation COCO [lin2014microsoft] 118K 1 0.25 1.0
Masked Language Modeling Books&Wiki [zhu2015aligning] - 256 0.05 0.5
Image Captioning YFCC [yfcc] 14.8M 24 0.09831 0.16385
CC12M [changpinyo2021cc12m] 11.1M 24 0.08514 0.1419
CC3M [sharma2018conceptual] 3M 24 0.04428 0.0738
Visual Genome [krishna2017visual] 108K 24 0.02973 0.04955
COCO Caption [Chen2015MicrosoftCC] 113K 24 0.0192 0.032
SBU [ordonez2011im2text] 830K 24 0.02328 0.0388
sum 29.9M - 0.3 0.5
Image-Text Retrieval YFCC [yfcc] 14.8M 28 0.09831 0.3277
CC12M [changpinyo2021cc12m] 11.1M 28 0.08514 0.2838
CC3M [sharma2018conceptual] 3M 28 0.04428 0.1476
Visual Genome [krishna2017visual] 108K 28 0.02973 0.0991
COCO Caption [Chen2015MicrosoftCC] 113K 28 0.0192 0.064
SBU [ordonez2011im2text] 830K 28 0.02328 0.0776
sum 29.9M - 0.3 1.0
Table 8: Tasks and datasets used for our joint training. ”#data” is the amount of visual training samples. For image captioning and image-text retrieval tasks, a combination of image-text-pair datasets is used for training, which has about 29.9M visual samples after filtering the data overlapping with validation sets. To alleviate the data imbalance problem in the combination of image-text-pair datasets during multi-task training, sampling weight for each dataset is set to be proportional to the square root of the dataset size, which has demonstrated to be effective [zhu2022uni].

Appendix C Detection on Novel Categories

Thanks to the general task modeling of Uni-Perceiver v2, different tasks can borrow knowledge from each other. For example, object detection task can generalize to novel categories in image classification dataset. Fig. 4 shows the detection result of Uni-Perceiver v2 on images in ImageNet-1k validation set whose categories do not exist in COCO dataset. This demonstrates the generalization ability of Uni-Perceiver v2, indicating the benefit of general task modeling.

Figure 4: Detection results on novel categories. We show the detection results of images from ImageNet-1k validation set. Note that Uni-Perceiver v2 only uses COCO dataset for the training of image detection task, and most classes in ImageNet-1k are not seen in training.

Appendix D Licenses of Datasets

ImageNet-1k [deng2009imagenet] is subject to the ImageNet terms of use imagenetterms.

COCO [lin2014microsoft] The images are subject to the Flickr terms of use flickr2020terms.

BookCorpus [zhu2015aligning]

Replicate Toronto BookCorpus is open-source and licensed under GNU GPL, Version 3.

Wikipedia Most of Wikipedia’s text is co-licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA) and the GNU Free Documentation License (GFDL) (unversioned, with no invariant sections, front-cover texts, or back-cover texts). Some text has been imported only under CC BY-SA and CC BY-SA-compatible license and cannot be reused under GFDL.

YFCC [yfcc] All the photos and videos provided in YFCC dataset are licensed under one of the Creative Commons copyright licenses.

CC12M changpinyo2021cc12m is licensed under the Terms of Use of Conceptual 12M cc12mlicense.

CC3M [sharma2018conceptual] is licensed under the Conceptual Captions Terms of Use cc3mlicense.

Visual Genome [krishna2017visual] is licensed under a Creative Commons Attribution 4.0 International License vgterms.

COCO Captions [Chen2015MicrosoftCC] The images are subject to the Flickr terms of use flickr2020terms.

SBU Caption [ordonez2011im2text] The images are subject to the Flickr terms of use flickr2020terms

ieee_fullname egbib