Log In Sign Up

LayoutBERT: Masked Language Layout Model for Object Insertion

by   Kerem Turgutlu, et al.

Image compositing is one of the most fundamental steps in creative workflows. It involves taking objects/parts of several images to create a new image, called a composite. Currently, this process is done manually by creating accurate masks of objects to be inserted and carefully blending them with the target scene or images, usually with the help of tools such as Photoshop or GIMP. While there have been several works on automatic selection of objects for creating masks, the problem of object placement within an image with the correct position, scale, and harmony remains a difficult problem with limited exploration. Automatic object insertion in images or designs is a difficult problem as it requires understanding of the scene geometry and the color harmony between objects. We propose LayoutBERT for the object insertion task. It uses a novel self-supervised masked language model objective and bidirectional multi-head self-attention. It outperforms previous layout-based likelihood models and shows favorable properties in terms of model capacity. We demonstrate the effectiveness of our approach for object insertion in the image compositing setting and other settings like documents and design templates. We further demonstrate the usefulness of the learned representations for layout-based retrieval tasks. We provide both qualitative and quantitative evaluations on datasets from diverse domains like COCO, PublayNet, and two new datasets which we call Image Layouts and Template Layouts. Image Layouts which consists of 5.8 million images with layout annotations is the largest image layout dataset to our knowledge. We also share ablation study results on the effect of dataset size, model size and class sample size for this task.


page 6

page 7

page 8

page 10

page 12

page 13

page 14

page 15


Image Generation from Scene Graphs

To truly understand the visual world our models should be able not only ...

Layout Generation and Completion with Self-attention

We address the problem of layout generation for diverse domains such as ...

Image Generation from Layout

Despite significant recent progress on generative models, controlled gen...

Object-Centric Image Generation from Layouts

Despite recent impressive results on single-object and single-domain ima...

Learning 3D Object Shape and Layout without 3D Supervision

A 3D scene consists of a set of objects, each with a shape and a layout ...

OPA: Object Placement Assessment Dataset

Image composition aims to generate realistic composite image by insertin...

Towards Automated Infographic Design: Deep Learning-based Auto-Extraction of Extensible Timeline

Designers need to consider not only perceptual effectiveness but also vi...

1 Introduction

With the recent rise in image and video creation for teaching, advertisement, information sharing, and social

Figure 1: Iterative class conditional compositing using LayoutBert bounding box predictions and alpha masking. At each step, object insertion orders are re-sorted based on bottom bounding box coordinates to avoid unrealistic occlusion.

influencing, the need for AI-based assistance in image and video editing is greater than ever. Image Compositing is one of the most common tasks in creative workflows. However, currently, it involves several manual steps such as background and foreground selection, masking, refinement, placement, scale-adjustment, and harmonization. Due to this tedious and multi-step process it is difficult for creatives to try more than few new design ideas. Moreover, the learning curve for beginners is quite steep making it inaccessible for the majority of users who are interested in expressing themselves in a creative way.

The success of self-attention and transformer networks on several key language and vision learning tasks has led to several new explorations of these architectures including layout understanding and generation. In this work, we further push the state-of-the-art on layout understanding and show very promising results on realistic images as well as on documents and templates. We envision a future where layout input from the user is the exception rather than the norm.

Figure 2: Masked input sequences during LayoutBert training. Random left-right flip is applied during training as data augmentation. Later the 2D layout is converted into the input sequence for modeling. A bounding box is selected for masking with uniform sampling and each token of the selected bounding box is masked iteratively and added to the batch. For each added sample, the model predicts the left-most masked token (denoted by pink coloring).

Usually, creators/designers start with a blank canvas, initialize their work with a base or background image and bring in parts from multiple images while applying geometric and color transformations (edits) until the desired creation is achieved. The final creation can be a personal family collage, an advertisement photo, a sci-fi movie poster, a petting zoo fundraiser flyer, etc.

There are many existing works based on automated color transformation learning such as deep image harmonization [24, 19]. However, none of them discusses geometric transformation learning for image compositing or the existing work often limits the problem to certain classes or less complex datasets.

In this paper we propose a bidirectional likelihood based model which can learn the most likely location and scale for a given object to be inserted into an image. Our approach offers a solution at scale to automate both photo-realistic and template-like object insertion conditioned on the desired class using BERT [5] with a custom self-supervised optimization objective. We also provide layout based retrieval results using the representations learned from the self-supervised training. Our main contributions are:

  • A novel self-supervised masked language model LayoutBERT for layout understanding,

  • Application to object insertion and layout retrieval using LayoutBERT,

  • State-of-the-art results on multiple datasets and ablation studies on the largest known image layout dataset.

2 Related Work

There are several studies in the literature on deep image compositing [17, 15]

which use generative adversarial networks (GANs)


and specialized networks such as spatial transformer networks (STNs)

[9]. [17] trains sequential STN generators: It iteratively applies geometric transformations (image warping) on an initially placed foreground image and tries to generate a realistic final composite image. [15] proposes two network modules, (1) where and (2) what modules. The where module generates realistic bounding boxes by transforming a unit bounding box with a STN and the what module uses that bounding box to generate a semantic map for the desired class instance while optimizing to make the final semantic map realistic. A separate model is trained for each class, e.g

. a pedestrian model and a car model in the case of the Cityscapes dataset

[4]. Unfortunately, neither of these approaches can be scaled to larger model and to many classes.

Although GANs are proven to work very well on realistic image generation with high fidelity, they are notoriously difficult to train and are not easily scalable. Self-attention and transformer networks have been successfully used to train billion parameter models and even a trillion parameters language model [23]. Recently, multi-head self-attention has been used to train models for layout generation and completion [8, 1]. An image can be represented as a set of layout elements by extracting class and bounding box information of the overall scene and of the objects in it. Similarly, a text document can be represented by the bounding box information of different elements such as titles, texts, graphs, and figures. The same idea can be extended to any creative design like posters, flyers, invitation cards, etc. In this regard, [8]

trains an autoregressive model using a causal attention mask and maximizes log-likelihood via a next token prediction task similar to

[22] in a self-supervised manner. [1]

extends this idea by proposing a variational autoencoder (VAE) to learn better representations using a BERT

[5] like encoder and a GPT [22] like decoder. Layout transformer models use only the class and bounding box information extracted from the image pixel data. This can be considered as a disadvantage over CNN-based models in terms of the granularity of information, but this same property allows more efficient training and inference, hence scalability.

Although previous layout transformer models can model the data likelihood and the distribution, they are not optimal for the object insertion task. [8] can generate or complete layouts but it can only attend to left context and cannot see the whole scene at once, which is a major drawback for the object insertion task. [1] can learn better representations and improve generation diversity but that usually comes with the cost of likelihood. [15] directly learns where to generate bounding boxes for object insertion with its where module; however it requires training of a separate model for every class. Also, a GAN objective makes it difficult to use ’too powerful’ discriminators or generators. We argue that the object insertion task requires seeing the whole scene at once, modeling long range dependencies in the scene with a module like self-attention and large scale models. Our work combines and extends ideas from prior art and can be considered similar to the where module of [15] in terms of the task and to [8]

in terms of the input representation and multi-head self-attention mechanism. Layout understanding is similar to scene understanding and using bidirectional context is a natural choice for learning realistic layouts.

3 LayoutBERT

Our method treats image, document or template layouts as scene graphs and tries to solve the object insertion task using a masked language modeling objective. For this purpose we use a bidirectional transformer model like BERT due to its popularity and name our method accordingly. That been said, our custom masked language modeling objective for layout understanding can be used with any of the transformer models like [14, 25, 3, 2]

and even with a bidirectional LSTM, GRU, or RNN. Original BERT model for NLP language modeling is trained using two tasks: masked LM and next sentence prediction (NSP). However, neither of these tasks are suitable for layout understanding and object insertion. Masked LM objective picks individual tokens to be masked during training, in contrast we iteratively mask a span of tokens which correspond to a bounding box. Next Sentence Prediction is used for classifying whether a sentence comes after another given sentence and cannot be used for token generation in the context of the object insertion task. Also,

[20] shows that removing the NSP loss matches or slightly improves downstream NLP task performance, hence not critical.

3.1 Layout Representation

To create input sequences for training we primarily follow the ideas from [8]. We use bounding box annotations of raw images, documents or templates depending on the dataset. When bounding box annotations are not available and during inference time we use a pretrained panoptic segmentation model [13] in the case of images and an object detection model in the case of documents and templates. We convert bounding boxes into a flat sequence using the raster scan order. A sequence input is then represented as: BOS, , …, EOS, where c,x,y,w,h stands for class token, top-left x coordinate, top-left y coordinate, width and height tokens respectively. BOS and EOS are special tokens: beginning and end of sentence. During tokenization each class id is mapped to a unique class token, and x,y,w,h tokens are mapped to discrete space by splitting the 2D input into a grid similar to anchors from the object detection literature.

3.2 Model Architecture and Training

We use BERT architecture introduced in [5] and optimize it using a novel self-supervised training objective for layout understanding. For the object insertion task it is important to look at the whole context at once for generating bounding boxes. This is where bidirectional attention comes to play. During training we randomly select a bounding box and mask the 5 tokens c,x,y,w,h which represent it. We do it by creating 5 duplicates for each sequence sample and masking all 5 tokens in the beginning and iteratively unmasking the left-most token at each step as shown in 2. For each masked sequence we try to predict the left-most masked token. This custom masked language modeling objective allows our model to generate bounding boxes by predicting c,x,y,w and h in a step-by-step fashion by mimicking autoregressive likelihood models in the context of a single bounding box generation while being able to attend to all the other bounding boxes with bidirectional attention. During training time we also apply random left-right flip as data augmentation on the 2D layout before converting it into a flat input sequence.

Instead of using teacher forcing with a decoder-only model like [8]

, we use a BERT architecture with bidirectional attention and model the joint distribution as:


where is the element of the box. For example, for , for , for , for , and for .

4 Experiments

In our evaluations, we refer to [8] as LayoutTransformer, our re-implementation of [8] using GPT [22] model as LayoutGPT and all other methods with their standard names.

4.1 Datasets

We evaluate our results on diverse datasets with natural scenes, documents, and design templates. We use a separate hold-out set for training and validate all models using the official validation splits for each dataset.

COCO [18] is a natural scene dataset with common objects which contains both object class and ’stuff’ class annotations. The object class contains a predefined set of 80 common objects while the stuff class contains 92 non-object annotations like sky, wall, grass and pavement. Stuff annotations are as important as object annotations for understanding the overall scene and for the object insertion task. We used COCO Panoptic 2017 dataset and followed the same preparation steps as [8]. This resulted in 118,280 layouts in the training split and 5,000 layouts in the validation split. The other class in the stuff annotations is ignored, resulting in final 80 thing and 91 stuff classes.

PublayNet [27] is a public large scale dataset for document layout understanding. It has 5 categories: text, title, figure, list and table. Similarly, we followed [8] for data preparation steps including removing layouts with more than 128 elements. This resulted in 335,682 and 11,245 documents layouts for training and validation splits respectively.

Image Layouts is a large scale image dataset with 5.8 million images crawled from the web. Manually annotating such a large dataset is expensive and labor intensive. For that reason we used a pretrained panoptic segmentation model [13] available at [26] to generate stuff and object class bounding box annotations. It has total of 133 stuff and object classes.

Template Layouts is a dataset with creative design templates such as posters, flyers, collages, social media posts, ads and more. It has total of 45k templates and 2 classes image and text.

The Image Layouts and Template Layouts datasets are not publicly available, and are curated for experimentation on large scale layout understanding in diverse domains.

4.2 Object Insertion

Our custom masked language model objective allows bounding box generation given a layout sequence. This is useful for the object insertion task since our model can predict the most likely class, location and scale to be inserted. LayoutBERT is designed with the object insertion task in mind and can attend to all the bounding boxes in the scene at once. In contrast, previous work [8] is trained using a decoder-only autoregressive transformer model for generation and can only attend to the left context. During our qualitative analysis we used our own implementation of [8]

, LayoutGPT, since there were no open-source code or models available for conducting our experiments. Our re-implementation outperforms the results reported in the original paper.

Object Class Recommendation

. We can make class recommendations about which foreground objects are more likely to be inserted into a given image, document or template. For this purpose, we feed the input sequence to the model and get the output probabilities at each token which give us the most likely predictions for the next token. In the case of the LayoutBERT model, we insert 5 sequential masked tokens in every possible sequence location and get the output probabilities for the masked class token.

For a simple sequence like BOS, c, x, y, w, h, EOS with only 1 bounding box, all possible mask insertions can be seen as:

  • Predict at position 1: BOS, [MASK], [MASK], [MASK], [MASK], [MASK], c, x, y, w, h, EOS

  • Predict at position 2: BOS, c, x, y, w, h, [MASK], [MASK], [MASK], [MASK], [MASK], EOS

where we get output probabilities for the masked class tokens denoted in bold. Using these probabilities, we identify the most likely classes that can be inserted after a given partial sequence and later use it for class conditional bounding box generation. We provide top-1 accuracy results for class recommendations in Table 1. The bidirectional masked language objective allows LayoutBERT to learn correct object classes in all three datasets, showing a significant improvements compared to previous approach.

COCO PublayNet Template Lay.
LayoutGPT 0.30 0.80 0.78
Ours 0.44 0.95 0.86
Table 1: Top-1 accuracy class recommendation on COCO, PublayNet and Template Layouts. Higher is better.

Bounding Box Generation. When the target class to be inserted is known, either provided or recommended by the model, we can use the model probability outputs at each token to identify the most likely sequence of locations to insert the class token for bounding box generation. For generation with the LayoutGPT model, we use beam search with top-k and top-p sampling with values k=15 and p=0.9. For generation with LayoutBERT model, we use top-k sampling where k=3. These values can be modified to control the level of diversity in bounding box generation. Since LayoutGPT uses a causal mask, it only attends to the previous tokens and is not able to incorporate bidirectional context like LayoutBERT does. To incorporate bidirectional context to LayoutGPT model during generation, we apply left-right flip as a test time augmentation (TTA). For a simple sequence like BOS, c, x, y, w, h, EOS with only 1 bounding box, class conditional iterative bounding box generation at index position 1 can be seen as:

  • Predict x: BOS, c, [MASK], [MASK], [MASK], [MASK], c, x, y, w, h, EOS

  • Predict y: BOS, c, x, [MASK], [MASK], [MASK], c, x, y, w, h, EOS

  • Predict w: BOS, c, x, y, [MASK], [MASK], c, x, y, w, h, EOS

  • Predict h: BOS, c, x, y, w, [MASK], c, x, y, w, h, EOS

We provide mean intersection over union (mIoU) results on class conditional bounding box generations in Table 2. It is calculated by taking the average of predicted and ground truth bounding box IoUs for a given sequence.

COCO PublayNet Template Lay.
LayoutGPT 0.31 0.52 0.21
Ours 0.36 0.78 0.27
Table 2: Class conditional bounding box generation mIoU on COCO, PublayNet and Template Layouts. Higher is better.

Scoring and Post-Processing. Each bounding box generation has an associated output probability. This allows us to score any predicted bounding box x, y, w, h and use this score for ranking. We apply non-maximum suppression (NMS) on our bounding box generations similar to popular object detection models [6]. This allows us to remove bounding boxes with lower scores that overlap too much. The NMS threshold is another controllable parameter like top-k and top-p. After top scoring bounding boxes are obtained we can use alpha composition to insert the object into the image. An example of such compositing is given in Figure 1 where the same steps are applied to place 3 new objects in the original image.

Quantitative Results. Negative log-likelihood (NLL) is a common metric used in previous works for the layout generation task. It is also a good choice for assessing the object insertion task performance since it can be considered as a proxy for class recommendation and bounding box generation accuracy. We outperform current state-of-the-art model [8] across all datasets in terms of the NLL metric shown in Table 3. We also compare LayoutBERT to prior art on COCO in Table 4. It can be seen that class (Table 1) and bounding box (Table 2) prediction performance correlates with NLL (Table 3).

LayoutTransformer [8] LayoutGPT Ours
COCO [18] 2.67 2.57 2.32
PublayNet [27] 1.28 1.14 0.63
Image Layouts - 2.26 2.08
Template Layouts - 2.19 1.98
Table 3: NLL results on COCO, Publaynet, Image Layouts and Template Layouts datasets using 64x64 anchor resolution.
Model NLL
LayoutVAE [11] 3.29
ObjGAN [16] 5.24
sg2sim [10] 3.4
LayoutTransformer [8] 2.28
Ours 1.91
Table 4: NLL results on COCO using 32x32 anchor resolution. Lower is better.

Qualitative Results. During qualitative analysis we use the steps described in 4.2 and generate composites to be visualized. We pick random samples from the validation set of each dataset, identify top-k classes to be inserted to each sample and then conditionally generate the bounding boxes. In Figure 3 and Figure 4 we insert the most likely bounding box for the top-1 predicted class on PublayNet and Template Layouts datasets respectively, and visualize samples side-by-side before and after the object insertion. In Figure 5, we identify the top-5 classes for each sample from COCO dataset and generate class conditional bounding boxes with top-k sampling to show diverse results.

Figure 3: Side-by-side view of before and after top-1 class bounding box generation on PublayNet validation documents as described in 4.2. Purple: text, yellow: table, white: title, cyan: figure, blue: list. Best viewed in color.
Figure 4: Side-by-side view of before and after top-1 class bounding box generation on Template Layouts validation templates as described in 4.2. Yellow: text, green: image. Best viewed in color.
Figure 5: Class recommendations and bounding box generations conditioned on top-5 classes using COCO validation images as described in 4.2. Predicted classes are sorted from left to right by their associated probability scores. We use k=3 for top-k sampling during bounding box generation. Lighter bounding box color means lower ranking score. Best viewed in color.

4.3 Layout Retrieval

Each GPT [22] feature at a given layer is calculated by attending only to the previous tokens from the previous layer. We tried both the last feature from the last hidden state and the average of all the features from the last hidden state for generating representations for a given layout in the case of the LayoutGPT model. During the qualitative comparison, the average of the last hidden state gave better results, so we use it for extracting representations to be used in the quantitative evaluation. BERT [5] uses bidirectional self-attention so every feature can attend to every other feature from the previous layer. So we use the average of the last hidden state to extract representations in the case of the LayoutBert model.

Quantitative Results. During quantitative analysis, mean average precision at k is used since it is a common choice for retrieval tasks and is widely used for assessing ranked recommendations. None of the datasets used during training were designed for a retrieval-by-layout task in mind, so they do not have relevancy information for a given pair of images, documents or layout templates. We conducted an external assessment job using a data annotation platform which has experienced taskers to manually assess the performance of the models. We ran jobs for the COCO, PublayNet and Layout Templates datasets and compare the retrieval performance of LayoutGPT and LayoutBERT models. During the COCO evaluation, actual images are shown to taskers, and during the PublayNet and Layout Templates evaluations, rendered layouts with color-coded bounding boxes are shown to taskers.

We use cosine similarity for retrieving similar layouts and report our results on mAP@5. For a given query the top 5 retrieved layouts are shown for assessment. For each dataset, we use the official validation set for retrieval evaluation. We use 1k random samples as the query set and the remaining as the recall set. Each query is shown to 3 different experienced taskers for final metric calculation which we report in Table

5. Final mAP@5 is calculated by taking the weighted average using tasker trust scores:


where is the total number of queries, is the average precision at 5 for query calculated by tasker , and is the trust score for tasker normalized to 1. Ease of Job is rated by taskers on a scale of 1 to 5.

LayoutGPT Ours Ease of Job
COCO [18] 0.51 0.50 3.0/5
PublayNet [27] 0.29 0.33 3.6/5
Template Lay. 0.46 0.47 2.5/5
Table 5: Retrieval results mAP@5.

Qualitative Results. During qualitative analysis we use the representations extracted from both models, retrieve the top neighbors based on cosine similarity and visualize them side-by-side. Again, we use actual images for COCO and rendered layouts for PublayNet and Layout Templates datasets. An example of retrieval by layout can be seen in Figure 6.

Figure 6: Retrieval results on COCO, PublayNet and Template Layouts using LayoutGPT and LayoutBERT. Left-most image is the query and the others are top-k neighbors. We show (a,b) LayoutGPT and LayoutBERT on COCO, (c,d) LayoutGPT and LayoutBERT on PublayNet, (e,f) LayoutGPT and LayoutBERT on Template Layouts. Best viewed in color.

5 Scaling Studies

Here we study the effect of the dataset, model and class sample size. We report ablation study results using our large scale Image Layouts dataset.

Training Details. Our small model consists of , , , and , medium model consists of , , , and , and large model consists of , , , and . We also use a dropout of 0.1 at the end of each feed-forward layer for regularization and GELU activation. We use Adam optimizer [12] with decoupled weight decay [21] with an initial learning rate of using cosine annealing starting after completing 0.75 of training.

5.1 Effect of Dataset and Model Size

We randomly sub-sampled training data in 20%, 60%, and 100% chunks, and 100% corresponds to 5.8 million layouts from the Image Layouts dataset. Each model is trained with an equal number of forward passes and backward updates, and using the same training schedule for fair comparison. Models with 100% of the training data are trained for 1.2 epochs, 60% of the training data are trained for 3 epochs and 20% of the training data are trained for 6 epochs. We plot results in Figure

7 which shows that small and medium sized models are not able to tolerate an increased number of samples as well as the large sized model. Also, the large model outperforms its smaller counterparts overall. This supports our motivation for creating a large scaled dataset for this task and increasing the model capacity with it. [8] mentions that they do not observe significant improvements beyond a 6-layer transformer model: However that is not the case with our large scale dataset where LayoutBERT-large shows a 3% improvement.

Figure 7: NLL plots of different model scales and sample sizes on Image Layouts dataset.

5.2 Effect of Class Sample Size

We study the performance of each class in our large scale Image Layouts dataset by plotting the NLL per class. There is a positive correlation between class sample size and performance as seen in Figure 8. Common stuff classes like sky, wall, sea, tree, grass and object classes like person have low error rates, while rare classes like toaster and parking meter have much higher error rates.

Figure 8: Class sample size vs. NLL.

6 Conclusion

Recent works on context-aware object synthesis and layout generation motivated us to find a scalable solution for the object insertion task. Our contributions allow modeling likelihood of hundreds of classes that can be inserted into a given scene by using a novel self-supervised masked language modeling objective and a bidirectional transformer model. Our method pushes the state-of-the-art further on diverse set of data domains like complex scenes, documents and design templates. We also show that deep bidirectional architectures achieve better results as training data size increases. One drawback of our approach is the limited input representation. In the case of the scene datasets, we lose information by transforming the input signal from the RGB domain to a 2D layout domain by limiting inputs to bounding boxes. In future, we will work on methods which can directly take the raw input data yet leverage the power of transformer models .