UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes

by   Alexander Kolesnikov, et al.

We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks. In contrast to previous models, UViM has the same functional form for all tasks; it requires no task-specific modifications which require extensive human expertise. The approach involves two components: (I) a base model (feed-forward) which is trained to directly predict raw vision outputs, guided by a learned discrete code and (II) a language model (autoregressive) that is trained to generate the guiding code. These components complement each other: the language model is well-suited to modeling structured interdependent data, while the base model is efficient at dealing with high-dimensional outputs. We demonstrate the effectiveness of UViM on three diverse and challenging vision tasks: panoptic segmentation, depth prediction and image colorization, where we achieve competitive and near state-of-the-art results. Our experimental results suggest that UViM is a promising candidate for a unified modeling approach in computer vision.


page 1

page 2

page 3

page 4

page 9

page 10

page 11

page 12


A Unified Sequence Interface for Vision Tasks

While language tasks are naturally expressed in a single, unified, model...

Transframer: Arbitrary Frame Prediction with Generative Models

We present a general-purpose framework for image modelling and vision ta...

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

We propose Unified-IO, a model that performs a large variety of AI tasks...

UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training

We propose to pre-train a unified language model for both autoencoding a...

Semantic bottleneck for computer vision tasks

This paper introduces a novel method for the representation of images th...

Computer Vision with a Single (Robust) Classifier

We show that the basic classification framework alone can be used to tac...

Parameterizing Region Covariance: An Efficient Way To Apply Sparse Codes On Second Order Statistics

Sparse representations have been successfully applied to signal processi...

1 Introduction

Many computer vision tasks require producing high-dimensional structured outputs. Examples include various types of image segmentation, monocular depth estimation, surface normal estimation, colorization, object detection, image super-resolution, etc. By handcrafting architectures and training procedures specific to each task, the structure of those target outputs can be exploited to learn better models. However, this fragmented approach impedes the ability to build a general solution ready to be applied to any task.

For the tasks above, that require predicting high-dimensional structured outputs, direct use of powerful parametric models such as CNNs 

krizhevsky2012alexnet ; simonyan2015vgg ; he2016resnet and Vision Transformers dosovitskiy2021vit , trained with decomposable (e.g. pixel-wise) loss is not sufficient, as this basic approach lacks the ability to model the structure of the output. To address this shortcoming, standard approaches turn to using additional modeling components such as, for example, anchor boxes ren2015fasterrcnn ; lin2017focal , non-maximal suppression ren2015fasterrcnn ; lin2017focal , matching losses carion2020detr ; cheng2021mask2former ; cheng2021maskformer or conditional random fields chen2017deeplab ; yuan2022new .

Recently, there have been significant advances in the modeling of complex structured outputs in the context of language generation and (conditional) image generation: autoregressive models 

van2016conditional ; salimans2017pixelcnn++ ; kolesnikov2017pixelcnn , GANs goodfellow2014generative , VAE kingma2013auto , VQVAE van2017vqvae , diffusion models sohl2015deep ; ho2020denoising . However, using such techniques to tackle discriminative problems in a unified way remains under-explored.

In this work, we propose a new approach, UViM, capable of modeling many vision tasks, leveraging recent advances in discrete representation learning van2017vqvae and language modeling vaswani2017attention . We show competitive results in three diverse tasks: panoptic segmentation kirillov2019panoptic , depth prediction silberman2012indoor and colorization zhang2016colorful

. Crucially, there are no task-specific components required for each task. All of the tasks use the same model and are amenable to transfer learning from standard pre-trained models.

2 Unified Modeling Approach for Vision

(a) Stage I training: we train the base model , which is guided by the code produced by the restricted oracle model . The oracle has access to the ground-truth label, but is only allowed to communicate with by passing a short discrete sequence, which we call a guiding code.
(b) Stage II training: we train a language model () to output a guiding code by learning to mimic the oracle, but using only the image input.
Figure 1: An overview of the UViM learning procedure. Blue blocks depict parts of the model which are being optimized, while black blocks depict frozen components.

We first discuss the motivation and inspiration behind our unified modeling approach: UViM. Then we give a high-level overview, followed by an in-depth explanation of its design and technical details.

2.1 Motivation

The field of computer vision made a huge leap by transitioning to models based on rich parametric functions (CNNs krizhevsky2012alexnet ; simonyan2015vgg ; szegedy2015inception ; he2016resnet , ViTs dosovitskiy2021vit ). Combined with well-working gradient-based algorithms to train these functions (e.g. Adam kingma2014adam ), it enables learning of complex feedforward mappings from inputs to outputs , which we call base models.

Despite the ability to train powerful and reusable base models, different vision applications, particularly those involving high-dimensional structured outputs, such as object bounding boxes, per-pixel segmentation masks or 3D point clouds, still rely on highly customized components and techniques listed in the introduction.

The necessity for introducing these non-trivial custom components has the same underlying root cause: the outputs are high-dimensional and structured (interdependent) and modeling complex interactions is a necessary condition for succeeding at a given task. Modeling such data is a long-standing challenge in computer vision (and beyond), with numerous books on the subject nowozin2011structured ; koller2009probabilistic and remains a relevant area of research.

In contrast to computer vision, a prominent unified modeling approach has been adopted in NLP. Many NLP tasks can be handled by an autoregressive sequence model raffel2020t5 parameterized by the Transformer architecture vaswani2017attention

. This approach combines multiple desirable properties: it is theoretically sound, expressive (capable of modeling a joint probability distribution of the outputs) and there are robust techniques for training such models.

Why has the computer vision field not yet adopted a similar unified model? There are papers that demonstrate that the NLP-like approach, based on autoregressive sequence models, is viable for some image tasks chen2022pixseq ; royer2017probabilistic . However, it only works for tasks that have compact output representations; the additional challenge in vision is that outputs are often very high-dimensional. NLP models typically model sequences of length up to 10’000 tokens, while in vision, outputs, such as per-pixel image segmentation/instance masks, may contain millions of elements. It is computationally prohibitive to apply autoregressive sequence models directly to such tasks.

2.2 Unified vision model via learned guiding code

Figure 2: The schematic illustration of UViM during inference.

Now we present our unified modeling approach for computer vision. We first give a high-level overview of the proposed model and then describe its components in detail.

We devise a unified vision model as a composition of a standard feedforward base model and an autoregressive language model of a short sequence. Our decomposition works well even for vision tasks that deal with extremely high dimensional and structured outputs.

Our key insight is to reduce the original task of modeling very high-dimensional structured output (e.g. panoptic segmentation mask) to modeling a short discrete sequence with the language model. For this, we propose an optimization procedure, illustrated in Figure 1. The resulting model during inference is depicted in Figure 2.

Stage I training: learning with a guiding code. To build a unified vision model we start from a base model , which directly maps task inputs to its outputs. As discussed above, learning such model with simple element-wise loss for a structured label space does not result in a good prediction model, as it is not modeling complex interactions within the output space.

To compensate for this modeling deficiency, we introduce an input , called guiding code. The assumption is that given and , the elements of the output have fewer dependencies, and can be modelled well by the base model. As an illustrative example, consider colorization: given a grayscale image of a car, the pixel colors are highly dependent (most cars are of uniform color). However, given a guiding code with the information “the car is red”, such cross-pixel dependencies cease to exist.

The guiding code has two key properties. First, it is represented as a short discrete sequence of the fixed length , i.e. . Second, it is derived from the output through the special function : . We call the restricted oracle, because it has access to the target (ground truth) , but at the same time is forced to compactly represent the information which will help to solve the task. Note, the restricted oracle is only used during training, but not at test time.

We train and jointly and end-to-end by minimizing a reconstruction loss between and

. For the reconstruction loss, we use the simplest task-appropriate loss function, e.g. pixel-wise cross-entropy or mean squared error. See stage I training step illustrated in Figure 


Empirically, we observe that the function , “aided” by the guiding code from the restricted oracle, is capable to solve the complex vision tasks very well, as measured by the task-specific standard metrics. Note, that is not a prediction model, as depends on the ground truth . Nevertheless, in this stage we have introduced a crucial component, which helps to reduce a high-dimensional structured prediction task to modeling a short sequence of discrete variables .

Stage II training: learning to model the guiding code. At the second stage, we model the discrete sequence using the input . The training data is a collection of input-output pairs , where is the fixed restricted oracle trained from the stage I. Note, that this task is equivalent to many standard NLP problems (except the input is an image) and there is a vast number of research and tools to tackle it. We use a standard encoder-decoder language model vaswani2017attention , which processes the image through the encoder and passes it to the autoregressive decoder. Training is performed end-to-end with gradient-based optimization. See Figure 1(b) for illustration of stage II learning step.

Resulting unified vision model. As a result of the two-stage optimization procedure, we obtain a final model , which we call UViM, short for Unified Vision Model. See Figure 2 for an overview. Later in the experimental section we show that such a model can be successfully trained to model highly structured outputs for very different vision tasks.

2.3 Implementation details

Joint training of base model and restricted oracle . Stage I training involves training a model that contains a discrete bottleneck , which is used to guide the base model . Such discrete bottleneck is problematic for training with gradient-based methods, as it does not have a gradient. To address this, we employ the technique introduced by the seminal VQ-VAE paper van2017vqvae . The key idea is to map the embeddings to be quantized to the nearest entry in a dictionary of -dimensional embeddings. We refer the reader to the paper for a detailed overview.

Addressing embedding dictionary usage. We observed that during Stage I training the usage of VQ-VAE dictionary may be highly unbalanced and certain entries going unused. To address this, we adapt the classic Linde-Buzo-Gray linde1980algorithm splitting algorithm to VQ-VAE’s dictionary learning procedure. Specifically, if, throughout the training process, we detect an unused embedding, we then take the most frequently used embedding and split it into two new embeddings by applying a tiny noise, and consequently replacing the unused one.

Architectures of functions , and . Throughout our experiments, we strive to use as uniform setup as possible. By default, we use a plain ViT architecture to parameterize all functions. Specifically, function and are modeled by the ViT architecture introduced in dosovitskiy2021vit . For historical reasons we equip with an additional input , though it appears not to affect the resulting model. The function is a standard encoder-decoder model and consists of two parts: and . The encoder, is also modeled by the ViT backbone. The decoder, is modeled by the standard transformer decoder, which is identical to the ViT model without initial projection for image patches.

Controlling sequence length of guiding code. As

is parameterized by the ViT model, its output is a collection of vectors arranged as a grid. To disentangle the number of vectors from the guiding code size, we optionally perform a linear spatial resize operation.

Dropout for guiding code. Empirically, we find that modeling the code during phase II can be quite challenging. This motivates us to explore a code dropout mechanism to affect the code complexity. For each training example in a batch, we randomly select an integer from 0 to , where is the code length. Then, we set a random subset of codewords to before inputting it to the model . As a result, base model learns to not rely on any individual code too heavily and the code becomes more robust. Intuitively, we expect that this can help to get better final stage II UViM model. We empirically validate the effect of this approach in Section 4.

Sampling from at test time. Running UViM at test time involves evaluating two functions: and then . While evaluating is straightforward, the function

is autoregressive and models a joint distribution

. Sampling from is a known and extensively studied task in NLP literature cho2014properties ; sutskever2014sequence ; vaswani2017attention . In our initial experiments we observed that the the simplest sampling approach seems to work well and more complex sampling techniques, such as beam search are not necessary. Thus, we sample using the most standard coordinate-wise sequential sampling . Note, we can optionally vary the temperature of the conditional distributions. By setting we can produce the “most likely” sample , but lose diversity. Contrary, with the default temperature , we can get diverse samples (and consequently diverse predictions), but potentially at the expense of prediction quality.

3 Experiments

We apply UViM to three diverse tasks: a general scene understanding panoptic segmentation task, a conditional generative image colorization task and a 3D scene understanding task of depth prediction. With UViM, we use a unified setup for all three seemingly different tasks. Quantitative results are presented in Table 

1 and qualitative results are in Appendix A. We describe our main modeling choices below, however we also provide full configuration files (as-is) in the Appendix B. The full UViM code will be made publicly available in the big_vision codebase.111https://github.com/google-research/big_vision

Experimental setup for stage I. We parameterize the base model and the restricted oracle with ViT-B/16 model. For we use 6 layers instead of 12, as in the initial experiments we observed that a relatively small capacity is sufficient. Both models are trained from scratch.

The input and output resolution during stage I for all tasks is . For optimization we use a variant of Adafactor shazeer2018adafactor introduced in zhai2021scaling

. Due to differences in dataset size, we tune the learning rate and number of epochs per task, but all other hyperparameters are the same.

For the guiding code, , produced by the restricted oracle, we use a sequence length of 256 with 4096 dictionary entries. To put this choice into perspective, for the panoptic task, the original panoptic mask is encoded as roughly discrete values, each ranging approximately from 0 to 100. Thus, is more than three orders of magnitude more compact than the original label.

Experimental setup for stage II.

The language model consists of the encoder and autoregressive decoder. For the encoder, by default, we use the ViT-L/16 model. We initialize the encoder with the ImageNet-21k 

russakovsky2015imagenet pre-trained model from steiner2021train . For the decoder, we use the ViT-B model. Note, that there is no initial patch projection, as it uses guiding code as autoregressive input, this is equivalent to the standard BERT-Base devlin2018bert architecture.

As in the stage I, the input and output resolution for all tasks is , except for the panoptic task, where we use a higher input resolution of . For optimization, we use the same optimizer as in Stage I. For all tasks, we use a base learning rate of with cosine decay and, additionally, apply a 10-fold reduction for the pre-trained encoder weights. Due to differences in dataset size, the number of epochs is tuned per task.

For all our experiments we use Google Cloud TPU-v3 hardware. A phase I training run for panoptic segmentation requires 1.9k TPU-v3 hours, while a phase II training run requires 0.9k TPU-v3 hours.

Data augmentations We strive to use the simple and standard augmentations for all tasks. At train time we opt for using an inception crop szegedy2015inception , random horizontal flipping, followed by resize to a square-shaped input. At test time we only squared-shaped resize the inputs to the input resolution.

COCO Panoptic [PQ] NYU Depth v2 [RMSE] ImageNet Colorization [FID-5k]
UViM (ours) UViM (ours) UViM (ours)
DETR-R101 carion2020detr DenseDepth alhashim2018densedepth ColTran kumar2021colorization
Mask2Former cheng2021mask2former BinsFormer li2022binsformer Palette saharia2021palette
Table 1: Comparison of presented modeling approach (UViM) and other related works discussed in Section 5

including current state of the art. Note that ours is the only work covering a set of significantly different tasks dominated by different types of approaches. Standard deviations are computed across three independent reruns.

Figure 3: We demonstrate how UViM performs across three different diverse tasks. Note that when provided with oracle’s guiding code it achieves near perfect results (3rd row). Predictions of the final UViM model are exemplified in 4th row. They are generally of very high quality and confirm that can successfully learn to produce the guiding code from the image input.

3.1 Panoptic segmentation

Panoptic segmentation kirillov2019panoptic is a general scene understanding task, which requires mapping every image pixel to its semantic class and, if applicable, instance ID. We adopt the raw target representation used in the original paper: a 2-channel mask, where the first channel encodes semantics, and the second channel encodes instance IDs. During training we assign instances IDs in an a raster scan order of object centers.

We train on the COCO panoptic 2017 lin2014coco ; kirillov2019panoptic dataset. It has approximately 118’000 training images and 5’000 official validation images which we use for test. All hyper-parameters were selected on 4’096 images held out from the training data. For evaluation, we use the official metric, called panoptic quality (PQ), which jointly estimates the accuracy of semantic and instance segmentation. We train stage I for 1000 epochs and stage II for 200 epochs.

As the reconstruction loss during stage I, we use the standard cross-entropy categorical loss for each channel independently. At test time, the output mask is first formed by the predicted instance channel. Then each instance is labeled my the majority vote of pixels from the semantic channels. This avoids inconsistencies in which pixels with the same instance id, but different semantic categories are interpreted as different instances. We additionally remove tiny objects that occupy less than 0.1% of all pixels. At test time, we resize the outputs to the target resolution via nearest neighbour.

Table 1 shows that UViM achieves PQ, outperforming a recent strong baseline model DETR-R101 carion2020detr . We focus on evaluating the generality of our approach, hence we avoid specialization towards individual tasks, such as commonly-used feature pyramids or scale jitter augmentations. As a result we lag behind the most recent state-of-the-art cheng2021mask2former . We expect that the gap can be bridged by further refining UViM with better understanding of its components and smarter modeling choices.

Figure 4: UViM outputs for the colorization task. Different samples produced by re-sampling the guiding code from the language model . Visually, the resulting samples are consistent and diverse.

3.2 Colorization

Colorization requires mapping grayscale pixels of an image to plausible colors. In particular for a given input there are many possible outputs and as so Fréchet Inception Distance (FID) heusel2017fid is a common metric. We opt to model this as a mapping from grayscale to RGB and use mean squared error as reconstruction loss during stage I training. For training we use ImageNet russakovsky2015imagenet training split consisting of M examples. We follow ColTran kumar2021colorization and report FIDs using the prescribed splits of images for metric computation and resize our model predictions to . We train stage I for 100 epochs and stage II for 50 epochs.

Figure 4 demonstrates that UViM is capable of producing high-quality and diverse colorizations for natural images. Table 1 shows that it achieves an FID of on this task. This is slightly below the current state-of-the-art Palette saharia2021palette which uses diffusion models to cover a variety of tasks that output natural images. But significantly above ColTran kumar2021colorization which uses a conditional autoregressive transformer to output a low resolution colorization followed by an upsampling model.

3.3 Monocular depth estimation

Depth prediction is a 3D scene understanding task, which requires mapping every pixel to a depth value (distance to the camera). We quantize the depth into buckets using 256 uniformly spaced bins, and use softmax cross entropy as reconstruction loss during stage I.

We train on the NYU Depth V2 silberman2012indoor

dataset consisting of 47’584 training examples captured across 280 indoor scenes, and 654 official validation examples. For hyper-parameter selection we hold out all examples from 14 scenes from the training set. For evaluation, we report the common evaluation metric: root mean squared error (RMSE) on the standard crop of the evaluation images from 

eigen2014depth . At test time, we resize UViM outputs to the crop resolution via nearest neighbour. We train stage I for 200 epochs and stage II for 50 epochs.

Table 1 shows that UViM achieves an RMSE of RMSE on this task. To contextualize this result, this score is comparable to DenseDepth alhashim2018densedepth which uses an architecture composed of a pre-trained DenseNet-169 followed by upsampling layers and skip-connections. Our results still lags behind the most recent state-of-the-art model for this task li2022binsformer which consists of a mixed classification/regression loss, adaptive bins, auxiliary scene classification task and multi-scale prediction refining. However, UViM has very little task-specific tuning; our depth model is almost identical to out setup for panoptic segmentation, even sharing most hyperparameter values.

Default From Scratch no Dropout no Oracle no Autoreg. no Image
UViM (stage I) 75.7 75.7 85.8 19.6 75.7 66.1
UViM (stage II) 43.7 39.8 42.2 N/A 33.3 39.1
Table 2: Effect of ablating various UViM components on the panoptic segmentation task. PQ metric is computed on 4096 examples holdout from the training set. Besides stage II results (second row, black), we also show the results of stage I using the restricted oracle’s guiding code (first row gray).
Figure 5: Outputs of various models in our ablation. We demonstrate that base model alone is not capable of modeling structured outputs, but when supported with compact oracle’s guiding code, it achieves near perfect results. For completeness, we also present the result of the final UViM model.

4 Ablations

In this section we dive deep into understanding UViM and perform ablations of its key components. We run extensive ablations (summarized in Table 2) on the panoptic segmentation task. For completeness, together with the performance of the final UViM models, we demonstrate the performance of UViM models after stage I training, which use the code from the restricted oracle. Some of our ablations are also illustrated visually in Figure 5.

For the default setting we follow the main experimental setup, but use inputs for stage II training. To avoid overfitting to test data, all our ablations are performed using our custom splits, where we hold out 4096 randomly selected images from the training data and use those for evaluation.

Ablating pre-trained weights. UViM is designed to make transfer learning easy, as it uses plain parametric models (without any modifications) that are commonly used for large-scale pre-training steiner2021train . Nevertheless, we ablate the usage of pre-trained weights to understand how UViM will perform in the this scenario. “From Scratch” column in Table 2 shows the results (note as we only use pre-trained weights for the , stage I results are not affected). Notably we use a longer training schedule with 500 epochs, which improves from-scratch results due to slower convergence. We observe that the from-scratch trained model performs well and achieves competitive results. Nevertheless, from scratch training is  PQ points behind the default setup of using pre-trained weights.

Ablating code dropout. The idea of the code dropout procedure, described in section 2.3, is to make the guiding code learned during stage I less “strict”, so it will be easier to model it with the in stage II. Table 2 shows results of ablating this procedure in the “no dropout” column. As expected, ablating dropout results in better stage I results (by approximately  PQ points), as the oracle’s code is not weakened by the dropout. On the other hand, the final UViM model becomes worse, as the resulting code learned without dropout has much more complex structure. We support our intuition by comparing final training losses of the models, trained for code with and without dropout. The losses are measured as average negative log-likelihoods, as are equal to and , confirming that the code that was trained with dropout are much easier to learn.

For the depth estimation task, we observed no difference ablating code dropout, indicating that code dropout is not always necessary, only for tasks where the code can become challenging for the to learn.

Ablating restricted oracle model. In this ablation we evaluate the base model trained directly without . The results in Table 2 confirm our (see 5) qualitative assessment that directly predicting panoptic mask with pixel-wise loss works very poorly in the absence of the oracle model.

Ablating autoregressive structure of model. So far we have assumed that the guiding code

needs to predicted by a function capable to model a joint probability distribution, such as an autoregressive

. We ablate this design choice and train a non-autoregressive , which predicts all components of in a single pass, but otherwise is identical to default models that we use.

The results in Table 2 confirm that the autoregressive component for joint probability distribution modeling is crucial. Ablating this component leads to a significant quality drop of  PQ points. We observe a similar effect on depth estimation, where RMSE drops from 0.47 to 0.55.

Ablating image input at stage I training. One interesting ablation is to hide the image input from the base model . In this case our whole UViM model can be interpreted from a different, more limited perspective: learns to compress a label into and, at the same time, learns to decode it back into the original label. Then the learns to solve the task in this new learned compact space.

As shown in column “no image” of Table 2, the model solving the task in the learned space still performs reasonably well, though it lags behind the more general default approach. For depth estimation, the base model with no image obtains a similar performance to the full model (within RMSE), indicating that for this task the oracle can compress all of the information required to reconstruct the label into the guiding code.

Varying oracle code length and dictionary size

Figure 6: UViM model performance for the panoptic task (measured as PQ points).

Finally, we investigate how the size of the code affects performance. In particular, we vary the code length and dictionary size (the total number of discrete values for each component). Intuitively, a longer sequence and a larger dictionary make it easier to learn during stage I training, as the oracle “restriction” becomes weaker. However, it is not clear how the code parameters will affect the stage II training and the final UViM model, as longer sequence and more discrete values are potentially harder to learn for the model.

To study this trade-off we train nine models: a cross-product of sequence lengths and dictionary sizes . Figure 6 shows the results. As expected, UViM with oracle stage I model monotonically benefits from longer sequences and bigger dictionary sizes. However, the sweet spot for the final model is the code which is neither too long nor too short.

5 Related work

This paper is related to the vast amount of literature in computer vision, as the proposed modeling approach aims at unifying a wide array of vision tasks. We focus on the most related work that is either pushing in the same direction of model unification or uses highly related modeling techniques.

Generative and autoregressive models.

Like in generative modeling, we have a similar goal of modeling high-dimensional structured outputs. A notable work, Pix2Pix 

isola2017image2image , uses a conditional GAN model to map arbitrary image input to arbitrary image outputs. Despite going beyond generative tasks, and showing some outputs for semantic segmentation task, this model has not become a competitive approach, likely due to the complexity and instability of GAN training.

Autoregressive models gained a popularity in computer vision as (conditional) image generation tools van2016pixel ; van2016conditional ; salimans2017pixelcnn++ and later were used for tasks like image colorization royer2017probabilistic ; guadarrama2017pixcolor ; kumar2021colorization . However, scalability of autoregressive models for very high-dimensional outputs is a big problem, which was necessitating additional complexity, such as hierarchical generation van2016conditional ; kolesnikov2017pixelcnn or learning of an additional upsampling model guadarrama2017pixcolor ; kumar2021colorization . The idea of modeling a complex structured target by recurrent “autoregressive” invocations of a model was used in a customized implementations for visual relationship prediction kolesnikov2019detecting

and human pose estimation 

gkioxari2016chained .

Closer to our approach is the use of learned discrete representations with an autoregressively learned prior van2017vqvae . DALL-E ramesh2021dalle showed text conditioned image generation by using a decoder-only transformer to model a sequence of text and image discrete representations. VQGAN (esser2021vqgan, ) show high-quality natural image generation conditioned in arbitrary image inputs by using an adversarial and perceptual loss to learn discrete representations. ViT-VQGAN (yu2022vectorquantized, ) improved class-conditioned image synthesis with codebook improvements and parameterizing VQGAN with ViT (dosovitskiy2021vit, ). Similarity NÜWA wu2021nuwa propose a 3D transformer encoder-decoder, which covers language, image, and video with learned discrete representations. Notably, these works concentrate on the (conditional) generative image tasks and mostly ignore discriminative image tasks.

Scene understanding. There are several fundamental vision tasks that require a model to perform high-level scene parsing, such as object detection, instance or panoptic segmentation. Many standard methods, such as Faster-RCNN ren2015fasterrcnn , Mask-RCNN he2017mask and RetinaNet lin2017focal produce “dense” predictions for a large number of scored anchor boxes, followed by an ad-hoc non-maximal suppression procedure to eliminate redundant boxes. DETR carion2020detr takes an alternative approach with an end-to-end model using a set-based global loss (via bipartite matching of proposals and ground truth). The DETR model can also be used for panoptic segmentation (kirillov2019panoptic, ), where initial approaches involved combining models optimized for each sub-part of the task (instance and semantic classification). Maskformer cheng2021maskformer uses a mask loss to guide the set-loss and further claims that the mask classification view of the problem is important for panoptic and also semantic segmentation. Mask2former cheng2021mask2former shows a single architecture designed around masks can tackle all semantic, instance and panoptic segmentation tasks. Despite some promising convergence in the scene understanding area, the proposed approaches remain only viable for an important, but relatively narrow range of tasks.

Vision model unification. Pix2Seq chen2022pixseq proposes a model highly related to ours. It leverages a plain (sequence) language model for tackling the highly structured task of object detection. However, it is limited to the scenario when an output of a vision task can be manually represented as a short discrete sequence, which is rarely true for vision tasks. In nash2022transframer the authors propose a Transframer model, which uses a language model for modeling image outputs represented as sparse discrete cosine transform codes. However, the paper only shows qualitative results for “discriminative” tasks. Moreover, in comparison to our model, the Transframer is less flexible and powerful because it relies on the pre-defined fixed transform, while UViM learns discrete representations using a powerful end-to-end approach.

6 Conclusion and Discussion

UViM is a modeling approach for vision with an ambitious goal of unifying diverse vision tasks with one technique. Our resulting model consists of two components: an autoregressive language model (for modeling complex structured outputs) and a plain feed-forward base model that helps to handle high dimensional outputs efficiently. Empirically, we confirm that UViM is capable of tackling diverse vision tasks in a unified way, while achieving competitive results. Our tasks cover semantic scene understanding (panoptic segmentation), conditional generative image modeling task (colorization) and 3D scene prediction (depth prediction).

We see UViM as a brave new prototype of the general-purpose learning approach for computer vision. As such, it still has many rough edges that need more research. For example, we do not yet fully understand how to learn the optimal guiding code. Empirically, we observe that the final result is sensitive to the phase I code learning parameters. For example, code length of 256 seems overall better than 16 and 1024 in our experiments; or adding dropout to the code during its training results in a better final model. We hope future research will come up with better understanding of how to set up learning of the guiding code, beyond pure empirical observations. Another aspect is the computational efficiency. As reported in the paper, the training is relatively compute hungry. More research may be needed to find design choices that will lead to much more efficient training procedures.

We thank Ilya Tolstikhin, who was involved in the initial stages of the project and provided a useful feedback on UViM presentation. We thank Ting Chen for discussions at the early stage of this project. Additionally, we thank Manoj Kumar who answered our questions on the evaluation for the colorization task. We also thank Daniel Keysers for feedback on the text of this paper.


Appendix A Random sample of UViM predictions

Figure 7: Random UViM example outputs for the panoptic segmentation task. The first column is the ground truth, second is the input image. The remaining columns are model outputs, sampling the guiding code with (i.e. coordinate-wise argmax) in the third column, and with in the remaining ones.
Figure 8: Random UViM example outputs for the colorization task. The first column is the ground truth, second is the grayscale input image. The remaining columns are model outputs, sampling the guiding code with (i.e. coordinate-wise argmax) in the third column, and with in the remaining ones.
Figure 9: Random UViM example outputs for the depth prediction task. The first column is the ground truth, second is the input image. The remaining columns are model outputs, sampling the guiding code with (i.e. coordinate-wise argmax) in the third column, and with in the remaining ones.

Appendix B Configuration files for the panoptic task.

This section demonstrates full configs with all hyper-parameters for training UViM stage I and II, which follow the big_vision codebase222https://github.com/google-research/big_vision conventions.

1RES = 512
4def get_config():
5  config = mlc.ConfigDict()
7  config.task = ’panoptic_segmentation’
9  config.dataset = ’coco/2017_panoptic’
10  config.val_split = ’train[:4096]’
11  config.train_split = ’train[4096:]’
13  config.batch_size = 1024
14  config.num_epochs = 1000
16  config.pp_train = (
17      f’decode|coco_panoptic|’
18      f’concat(["semantics","instances"], "labels")|’
19      f’randu("fliplr")|’
20      f’det_fliplr(key="image")|det_fliplr(key="labels")|’
21      f’inception_box|’
22      f’crop_box(key="image")|crop_box(key="labels")|’
23      f’resize({RES})|’
24      f’resize({RES}, key="labels", method="nearest")|’
25      f’value_range(-1, 1)|make_canonical|’
26      f’keep("image","labels")’
27  )
29  config.pp_eval = (
30      f’decode|coco_panoptic|’
31      f’concat(["semantics","instances"], "labels")|’
32      f’resize({RES})|’
33      f’resize({RES}, key="labels", method="nearest")|’
34      f’value_range(-1, 1)|make_canonical|’
35      f’keep("image","labels")’
36  )
38  config.shuffle_buffer_size = 25_000
40  config.log_training_steps = 50
41  config.log_eval_steps = 250
42  config.checkpoint_steps = 1000
43  config.keep_checkpoint_steps = 20_000
45  # Model section
46  config.model_name = ’proj.uvim.vit’
47  config.model = mlc.ConfigDict()
48  config.model.input_size = (RES, RES)
49  config.model.patch_size = (PATCH_SIZE, PATCH_SIZE)
50  config.model.code_len = 256
51  config.model.width = 768
52  config.model.oracle_depth = 6
53  config.model.base_model_depth = 12
54  config.model.mlp_dim = 3072
55  config.model.num_heads = 12
56  config.model.dict_size = 4096  # Number of words in dict.
57  config.model.codeword_dim = 768
58  config.model.dict_momentum = 0.995
59  config.model.with_encoder_ctx = True
60  config.model.with_decoder_ctx = True
61  config.model.inputs = {
62      # +1 for void label
63      ’semantics’: (133 + 1, PATCH_SIZE**2),
64      # COCO: actually 98 train/78 validation.
65      ’instances’: (100, PATCH_SIZE**2),
66  }
67  config.model.outputs = config.model.inputs
69  # Optimizer section
70  config.optax_name = ’big_vision.scale_by_adafactor’
71  config.optax = dict(beta2_cap=0.95)
73  config.lr = 4e-4
74  config.wd = 4e-5
75  config.schedule = dict(decay_type=’cosine’, warmup_steps=4_000)
76  config.grad_clip_norm = 1.0
78  config.evals = [
79      (’panoptic_train’, ’coco_panoptic’),
80      (’panoptic_holdout’, ’coco_panoptic’),
81      (’panoptic_val’, ’coco_panoptic’),
82  ]
84  base_eval = {
85      ’pp’: config.pp_eval.replace(’decode|’, ’’),
86      ’log_steps’: 10_000,
87  }
89  config.panoptic_train = mlc.ConfigDict(base_eval)
90  config.panoptic_train.prefix = ’coco_panoptic_train/’
91  config.panoptic_train.split = ’train[4096:8192]’
93  config.panoptic_holdout = mlc.ConfigDict(base_eval)
94  config.panoptic_holdout.prefix = ’coco_panoptic_holdout/’
95  config.panoptic_holdout.split = ’train[:4096]’
97  config.panoptic_val = mlc.ConfigDict(base_eval)
98  config.panoptic_val.prefix = ’coco_panoptic/’
99  config.panoptic_val.split = ’validation’
101  return config
Listing 1: Full config for panoptic stage I training.
2    ’base’: dict(num_layers=12, num_heads=12,
3                 mlp_dim=3072, emb_dim=768),
4    ’large’: dict(num_layers=24, num_heads=16,
5                  mlp_dim=4096, emb_dim=1024),
8    ’base’: dict(oracle_depth=6, base_model_depth=12,
9                 num_heads=12, mlp_dim=3072, width=768),
11RES = LABEL_RES = 512
14def get_config():
15  config = mlc.ConfigDict()
16  config.pp_modules = [’ops_general’, ’ops_image’, ’proj.uvim.pp_ops’]
18  config.pp_train = (
19      f’decode|coco_panoptic|’
20      f’concat(["semantics","instances"], "labels")|’
21      f’randu("fliplr")|’
22      f’det_fliplr(key="image")|det_fliplr(key="labels")|’
23      f’inception_box|’
24      f’crop_box(key="image")|crop_box(key="labels")|’
25      f’resize({LABEL_RES}, inkey="image", outkey="image_ctx")|’
26      f’resize({RES})|’
27      f’resize({LABEL_RES}, key="labels", method="nearest")|’
28      f’value_range(-1, 1, key="image_ctx")|’
29      f’value_range(-1, 1)|make_canonical|’
30      ’keep("image", "image_ctx", "labels")’
31  )
32  config.pp_eval = (
33      f’decode|coco_panoptic|’
34      f’concat(["semantics","instances"], "labels")|’
35      f’resize({LABEL_RES}, inkey="image", outkey="image_ctx")|’
36      f’resize({RES})|’
37      f’resize({LABEL_RES}, key="labels", method="nearest")|’
38      f’value_range(-1, 1, key="image_ctx")|’
39      f’value_range(-1, 1)|make_canonical’
40      f’|keep("image", "image_ctx", "labels")’
41  )
42  pp_predict = (
43      f’resize({LABEL_RES}, inkey="image", outkey="image_ctx")|’
44      f’resize({RES})|’
45      f’value_range(-1, 1, key="image_ctx")|’
46      f’value_range(-1, 1)|’
47      f’keep("image", "image_ctx", "image/id")’
48  )
50  config.dataset = ’coco/2017_panoptic’
51  config.train_split = ’train[4096:]’
52  config.val_split = ’train[:4096]’
53  config.shuffle_buffer_size = 50_000
54  config.batch_size = 512
55  config.num_epochs = 200
57  config.log_training_steps = 50
58  config.log_eval_steps = 1000
59  config.checkpoint_steps = 1000
60  config.keep_checkpoint_steps = 5000
62  # Optimizer section
63  config.optax_name = ’big_vision.scale_by_adafactor’
64  config.optax = dict(beta2_cap=0.95)
65  config.lr = 0.001
66  config.wd = 0.000001
67  config.lr_mults = [
68      (’pos_embedding_encoder.*’, 0.1),
69      (’EmbedPatches.*’, 0.1),
70      (’encoder.*’, 0.1),
71      (’decoder.*’, 1.0)
72  ]
73  config.schedule = dict(decay_type=’cosine’, warmup_steps=4_000)
75  # Restricted oracle section
76  config.oracle = dict()
77  config.oracle.task = ’panoptic_segmentation’
78  config.oracle.model_init = ’oracle.npz’
79  config.oracle.model_name = ’proj.uvim.vit’
80  config.oracle.model = STAGE_I_MODELS[’base’]
81  config.oracle.model.input_size = (LABEL_RES,LABEL_RES)
82  config.oracle.model.patch_size = (LABEL_PATCH_SIZE,LABEL_PATCH_SIZE)
83  config.oracle.model.code_len = 256
84  config.oracle.model.dict_size = 4096
85  config.oracle.model.codeword_dim = 768
86  config.oracle.model.with_encoder_ctx = True
87  config.oracle.model.with_decoder_ctx = True
88  config.oracle.model.inputs = {
89       # +1 for void label
90      ’semantics’: (133 + 1, LABEL_PATCH_SIZE**2),
91       # COCO: actually 98 train/78 validation.
92      ’instances’: (100, LABEL_PATCH_SIZE**2),
93  }
94  config.oracle.model.outputs = config.oracle.model.inputs
96  # Model section
97  config.model_name = ’proj.uvim.lm’
98  config.model_init = {’encoder’: ’howto-i21k-L/16’}
99  config.model = LM_MODELS[’large’]
100  config.model.patches = dict(size=(PATCH_SIZE, PATCH_SIZE))
101  config.model.vocab_size = config.oracle.model.get_ref(’dict_size’)+1
102  config.model.posemb_type = ’learn’
103  config.model.input_size = (RES, RES)
104  config.model.seq_len = config.oracle.model.get_ref(’code_len’)
106  # Evaluation section
107  config.evals = [
108      (’panoptic_train’, ’coco_panoptic’),
109      (’panoptic_holdout’, ’coco_panoptic’),
110      (’panoptic_val’, ’coco_panoptic’),
111  ]
112  base_eval = dict(pp=pp_predict, log_steps=10_000)
114  config.panoptic_train = dict(base_eval)
115  config.panoptic_train.prefix = ’coco_panoptic_train/’
116  config.panoptic_train.split = ’train[4096:8192]’
118  config.panoptic_holdout = dict(base_eval)
119  config.panoptic_holdout.prefix = ’coco_panoptic_holdout/’
120  config.panoptic_holdout.split = ’train[:4096]’
122  config.panoptic_val = dict(base_eval)
123  config.panoptic_val.prefix = ’coco_panoptic/’
124  config.panoptic_val.split = ’validation’
126  return config
Listing 2: Full config for panoptic stage II training.