BLT: Bidirectional Layout Transformer for Controllable Layout Generation

by   Xiang Kong, et al.
Carnegie Mellon University

Creating visual layouts is an important step in graphic design. Automatic generation of such layouts is important as we seek scale-able and diverse visual designs. Prior works on automatic layout generation focus on unconditional generation, in which the models generate layouts while neglecting user needs for specific problems. To advance conditional layout generation, we introduce BLT, a bidirectional layout transformer. BLT differs from autoregressive decoding as it first generates a draft layout that satisfies the user inputs and then refines the layout iteratively. We verify the proposed model on multiple benchmarks with various fidelity metrics. Our results demonstrate two key advances to the state-of-the-art layout transformer models. First, our model empowers layout transformers to fulfill controllable layout generation. Second, our model slashes the linear inference time in autoregressive decoding into a constant complexity, thereby achieving 4x-10x speedups in generating a layout at inference time.



page 7

page 11


Generative Layout Modeling using Constraint Graphs

We propose a new generative model for layout generation. We generate lay...

The Layout Generation Algorithm of Graphic Design Based on Transformer-CVAE

Graphic design is ubiquitous in people's daily lives. For graphic design...

COFS: Controllable Furniture layout Synthesis

Scalable generation of furniture layouts is essential for many applicati...

DeepRapper: Neural Rap Generation with Rhyme and Rhythm Modeling

Rap generation, which aims to produce lyrics and corresponding singing b...

Warehouse Layout Method Based on Ant Colony and Backtracking Algorithm

Warehouse is one of the important aspects of a company. Therefore, it is...

Lossless Layout Image Compression Algorithms for Electron-Beam Direct-Write Lithography

Electron-beam direct-write (EBDW) lithography systems must in the future...

Automatic Layout Generation with Applications in Machine Learning Engine Evaluation

Machine learning-based lithography hotspot detection has been deeply stu...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Creating visual layouts is important for any design task ranging from generating documents, adverts, posters to furniture layout. Layout dictates the placement and sizing of graphic components, playing a central role in how viewer interacts with the information provided [lee2020neural].

(a) Conditional (top) and unconditional (bottom) generation.
(b) Unidirectional and bidirectional self-attention.
Figure 1: Comparison of conditional and unconditional layout generation. Each object is modeled by 5 attributes ‘category’, ‘X’, ‘Y’, ‘W’ (width) and ‘H’ (height). In conditional generation, attributes are partially given by the user and the goal is to generate the unknown attributes, e.g. putting the icon or button on the canvas. In contrast, unconditional generation produces layouts that do not respect to the user input.

Graphic layout generation is emerging as a new research direction for generating realistic and diverse layouts to facilitate design tasks. Recent works show promising methods of layout generation for applications such as graphic user interfaces [deka2017rico], presentation slides [guo2021layout], magazines [zheng2019content], scientific publications [arroyo2021variational], commercial advertisements [lee2020neural, qian2020retrieve], Computer-Aided Design (CAD) [willis2021engineering], indoor graphics scenes [di2021multi], etc.

Recent work explores neural models for layout generation using Generative Adversarial Networks (GANs) 

[goodfellow2014generative, li2019layoutgan]

and Variational Autoencoder (VAEs) 

[kingma2013auto, jyothi2019layoutvae, patil2020read, lee2020neural]. State-of-the-art layout generation models are built on the Transformer architecture [vaswani2017attention]. These transformers [gupta2020layout, arroyo2021variational] represent a layout as a sequence of objects and an object as a (sub)sequence of attributes (See Fig. 0(a)). Layout transformers predict the attribute sequentially based on previously generated output (i.e. autoregressive decoding). By virtue of the powerful self-attention, transformers are capable of modeling complex and long-range object relations, and, compared to GANs or VAEs, achieve superior quality and diversity on common layout benchmarks.

However, existing transformers only tackle unconditional layout generation where layout elements are generated from random seeds without considering specific user requirements111Although [gupta2020layout] showed layout completion, we consider it as uncontrollable generation as it provides users with little controls of the object attributes to be generated. This is because of the order of primitives that is an acknowledged limitation detailed in [gupta2020layout].. In this process, the users have no controls of what object to be generated or how big the object is, which is in direct contrast to the real-world scenario where the designer may already have objects with partially known attributes and hope to generate the missing attributes. As shown in Fig. 0(a), the user wants to place the icon and button with known sizes onto the canvas. This is a different setting to unconditional generation that pays no attention to the actual object category or the size at hand.

State-of-the-art transformers [gupta2020layout, arroyo2021variational] have difficulties in conditional generation due to the following two limitations:

  • Immutable dependency chain: Autoregressive transformers follow a pre-defined generation order of object attributes. For instance, the transformer in Fig. 0(b) must generate attributes starting from the category , then and , followed by and . The dependency chain is immutable i.e. it cannot be changed at decoding time. Therefore, autoregressive transformers fail to perform conditional layout generation when the condition disagrees with the pre-defined dependency. For example, it is impossible to generate the -position conditioning on the known width in Fig. 0(b).

  • High latency in decoding: Autoregressive decoding is not parallelizable, and the decoding time quickly becomes a bottleneck for the layout with a large number of objects. This is an issue for conditional generation because the model has no control of the number of objects the user specifies. For example, it can take on average 3 seconds to decode a layout with 20 objects on a CPU.

In this work, we introduce Bidirectional Layout Transformer (or BLT) for controllable layout generation. Different from the existing transformer models [gupta2020layout, arroyo2021variational], BLT enables controllable layout generation where every attribute in the layout can be modified, with high flexibility, based on the user inputs (cf. Fig. 0(a)). During training, BLT learns to predict the masked attributes by attending to attributes in two directions (cf. Fig. 0(b)). The bidirectional attention eliminates the immutable dependency, which allows the model to fulfill conditional generation. At inference time, we propose a parallel decoding algorithm, where BLT first generates a draft layout based on the user inputs, then iteratively refines the low-confident attributes in the layout. Compared to autoregressive decoding, it has constant time complexity and hence reduces the latency in decoding.

We evaluate the proposed method on six layout datasets under various metrics to analyze the visual quality. These datasets cover representative design applications for graphic user interface [deka2017rico], magazines [zheng2019content] and publications [zhong2019publaynet], commercial ads [lee2020neural], natural scenes [lin2014microsoft] and home decoration [fu20203dfront]. Experiments demonstrate two key benefits to the state-of-the-art layout transformer models [gupta2020layout, arroyo2021variational]. First, our model empowers transformers to fulfill controllable layout generation. Even though our model is not designed for unconditional layout generation, it achieves quality on-par with the state-of-the-art. Second, our new method reduces the linear inference time complexity in [gupta2020layout, arroyo2021variational] to a new constant complexity, thereby achieving 4x-10x speedups in layout generation. To summarize, we make the following contributions:

  1. Novel method that empowers transformer model to carry out conditional and controllable layout generation.

  2. Reduce the time complexity in layout transformer to a better constant complexity.

  3. Extensive experiments validate that our method performs favorably against state-of-the-art models in terms of realism, alignment, and semantic relevance on six layout datasets.

2 Related Work

Layout Synthesis:

Recently, automatic generation of high-quality and realistic layous has fueled increasing interest. Data-driven methods rely on deep generative models such as GANs [goodfellow2014generative] and VAEs [kingma2013auto]. For example, LayoutGAN [li2019layoutgan] uses a GANs-based framework to synthesize semantic and geometric properties for scene elements. During inference time, LayoutGAN generates layouts from the Gaussian noise. LayoutVAE [jyothi2019layoutvae] introduces two conditional VAEs. The first aims to learn the distribution of category counts which will be used during layout generation. The second produces layouts conditioning on the number and category of objects generated from the first VAE or reference data. Due to limited model capacity, it performs worse than the self-attention-based models. READ [patil2020read]

uses heuristics to model the relationships and trains a RNNs-based VAE to generate document layouts. Neural Design Networks (NDN) 

[lee2020neural] is a state-of-the-art VAEs-based model for conditional layout generation, which focuses on modeling the asset relations and constraints by graph convolution. Our work is different from NDN in modeling the layout and user inputs by the transformer, which, as shown in Table 5, perform more favorably thanks to the transformer architecture.

Currently, the state-of-the-art for layout generation is held by the transformer models [vaswani2017attention]. In particular, [gupta2020layout] employs the standard autoregressive Transformer decoder with unidirectional attention. They find out that self-attention is able to explicitly learn relationships between objects in the layout, resulting in superior performance compared to prior works. Furthermore, to increase the diversity of generated layout, [arroyo2021variational] incorporates the standard autoregressive Transformer decoder into a VAE framework and [nguyen2021diverse] employs multi-choice prediction and winner-takes-all loss. Despite the superior performance, these models have difficulties in conditional layout generation. Similar to [lee2020neural], [kikuchi2021constrained] allows designers to add constraints through latent optimization.

Bidirectional Transformer:

The classic Transformer [vaswani2017attention] decoder uses the unidirectional self-attention mechanism to generate the sequence token-by-token from left to right, leaving the right-to-left contexts unexploited. Recently, people start to investigate generation tasks by bidirectional Transformers which allow representations to attend in both directions [devlin2019bert]. Our work is inspired by the success in the generative NLP tasks of language generation [gu2017non], text-to-speech generation [donahue2020end]. Our novelty lies in the proposed new masking strategy and decoding algorithm which, as substantiated by our experiments, are essential for layout generation.

3 Problem Formulation

Following [gupta2020layout], we use 5 attributes to describe an object, i.e., (, , , , ), in which the first element is the object category such as the logo and button, and the remainder details the bounding box information i.e. the center location and the width and height . Furthermore, float values in bounding box information is discretized using 8-bit uniform quantization. For instance, the -coordinate after the quantization becomes . A layout of assets is hence denoted as a flattened sequence of integer indices:


where and are special tokens to denote the start and the end of sequence. We use a shared vocabulary and represent each element in

as an integer index or equivalently as a one-hot vector with the same length. For simplicity, the notation uses five attributes to represent an object but we explore higher-dimensional attributes to model 3D or complex layouts in our experiments.


To train the model, prior work [gupta2020layout, arroyo2021variational]estimates the joint likelihood of observing a layout as:


During training, an autoregressive Transformer model is learned to maximize the likelihood using ground-truth attribute as input (i.e. teacher forcing). At inference time, the transformer model predicts the attribute sequentially based on previously generated output (i.e. autoregressive decoding), starting from the begin-of-sequence or token until yielding the end-of-sequence token . The generation must follow a fixed conditional dependency. For example, Eq. (1) defines an immutable generation order . And in order to generate the height for an object, one must know its - coordinates and width .

There are two issues with autoregressive decoding for conditional generation. First, it is infeasible to process user conditions that differ from the dependency order used in training. For instance, the model using Eq. (1) is not able to generate - coordinates from width and height i.e., , which corresponds to a practical example of placing an object with given size. This issue is exacerbated by complex layouts that require more attributes to represent an object. Second, the autoregressive inference is not parallelizable, rendering it inefficient for the dense layout with a large number of objects or attributes.

4 Approach

Our goal is to design a transformer model for controllable layout generation. We propose a novel method to learn non-autoregressive transformer. Unlike existing layout transformers [gupta2020layout, arroyo2021variational], the new layout transformer is bidirectional and can generate all attributes simultaneously in parallel, which allows not only for flexible conditional generation but also more efficient inference. In this section, we first discuss the model and training objective; then we detail a new parallel decoding algorithm by iterative refinement.

4.1 Model and Training

The BLT backbone is the multi-layer bidirectional Transformer encoder [vaswani2017attention] as shown in Fig. 2. We use the identical architecture as in the existing autoregressive layout transformers [arroyo2021variational, gupta2020layout] but the attention mechanism in our model is bidirectional in the sense that it can utilize richer contexts in two directions to predict the attribute.

(a) BLT Training Phrase.
(b) BLT Iterative Decoding Process.
Figure 2: The training (left) and decoding stage (right) of the proposed Bidirectional Layout Transformer (BLT).

Inspired by BERT [devlin2019bert], during training, we randomly select a subset of attributes in the input sequence, replace them with a special “[mask]” token, and optimize the model to predict the masked attributes. For a layout sequence , let denote a set of masked positions. Replacing attributes in with “[mask]” at yields the masked sequence .

Given a layout set , the training objective is to minimize the negative log-likelihood of the masked attributes:


While, we can technically predict all attributes, we do not compute the loss for the unmasked attributes as this simple copy-paste task does not constitute a learning task.

The masking strategy greatly affects the quality of masked language model [devlin2019bert]. BERT [devlin2019bert] applies random masking with a fixed ratio where a constant 15% masks are randomly generated for each input. Similarly, we find masking strategy is important for layout generation, but the random masking used in BERT does not work well. We propose to use a hierarchical sampling policy. To do so, we first divide the attributes of an object into several semantic groups, e.g. Fig. 2

shows 3 groups: category, position, and size. In the first step, we randomly select a semantic group. Next, we dynamically sample the number of masked tokens from a uniform distribution between one and the number of attributes belonging to the chosen group, and then randomly mask that number of tokens in the selected group. As such, it is guaranteed that the model only predicts attributes of the same semantic meaning each time.

4.2 Parallel Decoding by Iterative Refinement

In BLT, all attributes in the layout are generated simultaneously in parallel. Yet, generating layouts in a single pass is quite challenging. To tackle this, we introduce a new parallel decoding algorithm by iterative attribute refinement. The core idea is to generate a layout in a small constant number of steps where parallel decoding is applied at each.

1:Sequence with partially-known attributes. Constant for the number of iterations.
2:for  in [, , do Loop over semantic group
3:     for  to  do
5:          Compute mask ratio
6:          : # attributes in
7:          Get mask indices
8:         Obtain by masking with respect to
9:     end for
10:end for
Algorithm 1 Decoding by Iterative Attribute Refinement

Algorithm 1 presents the proposed decoding algorithm. The procedure is also illustrated in Fig. 1(b). The input to the decoding algorithm is a mixture sequence of known and unknown attributes, where the known attributes are given by the user inputs and the model aims at generating the unknown attributes decoded by the [mask] token.

Like in training, we divide the object attributes into three semantic groups: category (), size () and position (). For each iteration, only one group of attributes is generated. In Step 3 of Algorithm 1, the model makes parallel predictions for all unknown attributes, where denotes the prediction scores. In Step 6, it selects the attributes that belong to the chosen group with the lowest prediction scores. Finally, it masks these low-confident attributes on which the model has doubts. These masked attributes correspond to the difficult cases that will be re-predicted in the next iteration of refinement conditioning on all other ascertained attributes so far. The masking ratio calculated in Step 4 decreases with the number of iteration. This process will repeat times until all attributes of all objects are generated (cf. Fig. 1(b)). Note that the attributes given by the user are considered ground-truth and hence will not be masked during generation. Our algorithm is inspired by the bidirectional NMT model [ghazvininejad2019mask]. However, our algorithm is novel in layout generation for processing attribute groups.

Algorithm 1 can also be used for unconditional generation. In this case, the input is a layout sequence of only “[mask]” tokens, and the same algorithm is used to generate all attributes in the layout. Unlike conditional generation, we need to know the sequence length in advance, i.e. the number of objects to be generated. Here, we can simply use the prior distribution obtained on the training dataset. During decoding, we obtain the number of objects through sampling from this prior distribution.

4.3 Complexity Analysis

This section compares the complexity between the proposed method and the autoregressive layout transformers [arroyo2021variational, gupta2020layout]. We focus on the time complexity when full parallelization is assumed. To generate a sequence of length , the autoregressive transformer needs steps with the cost of for generating each attribute. So the total cost amounts to . By contrast, our model generates the layout in a constant number of steps , where the total cost equals . Besides, both models have the same space complexity of . The above theoretical analysis shows the constant time complexity of BLT and the finding is consistent with our empirical run-time comparison in Section 5.4.

5 Experimental Results

RICO Conditioned on Category + Size
Model IOU Overlap Alignment Sim. Sim.
Trans. 0.24 0.33 0.30 0.20 -
VTN 0.22 0.30 0.32 0.20 -
Ours 0.22 0.23 0.20 0.21 0.30
PubLayNet Conditioned on Category + Size
Model IOU Overlap Alignment Sim. Sim.
Trans. 0.09 0.06 0.33 0.11 -
VTN 0.10 0.06 0.33 0.10 -
Ours 0.09 0.04 0.25 0.11 0.18
COCO Conditioned on Category + Size
Model IOU Overlap Alignment Sim. Sim.
Trans. 0.60 1.66 0.34 0.20 -
VTN 0.63 1.79 0.32 0.22 -
Ours 0.35 1.93 0.16 0.24 0.44
Magazine Conditioned on Category + Size
Model IOU Overlap Alignment Sim. Sim.
Trans. 0.20 0.22 0.48 0.15 -
VTN 0.18 0.15 0.47 0.15 -
Ours 0.18 0.12 0.44 0.18 0.27
Table 1: Category (+ Size) conditional layout generation performance on various benchmarks.
Ads Conditioned on Category + Size
Model IOU Overlap Alignment Sim. Sim.
Trans 0.19 0.15 0.35 0.30 -
VTN 0.18 0.15 0.33 0.30 -
Ours 0.10 0.10 0.18 0.31 0.41
Table 2: Conditional generation performance on Image Ads.

This section verifies the proposed method on six layout benchmarks under four metrics to examine the quality. We compare to the state-of-the-art models on conditional and unconditional layout generation tasks. The results show our model performs favorably against the strong baselines and achieves 4x-10x speedups in layout generation.

Condition (Sim.) Trans VTN Ours
Category 0.04 0.04 0.06
Category + size - - 0.10
Table 3: Conditional generation performance on 3D-FRONT.
Methods IOU Overlap Alignment IOU Overlap Alignment IOU Overlap Alignment
LayoutVAE [jyothi2019layoutvae] 0.193 0.400 0.416 0.171 0.321 0.472 0.325 2.819 0.246
Trans. [gupta2020layout] 0.086 0.145 0.366 0.039 0.006 0.361 0.194 1.709 0.334
VTN [arroyo2021variational] 0.115 0.165 0.373 0.031 0.017 0.347 0.197 2.384 0.330
Ours 0.127 0.102 0.342 0.048 0.012 0.337 0.227 1.452 0.311
Table 4: Unconditional layout generation comparison to the state-of-the-art on three benchmarks. Results of baselines are cited from [arroyo2021variational] and our scores are calculated following the same method described in [arroyo2021variational].

5.1 Setups


We employ six datasets that cover representative graphic design applications. RICO [deka2017rico] is a dataset of user interface designs for mobile applications. It contains 91K entries with 27 object categories (button, toolbar, list item, etc.). PubLayNet [zhong2019publaynet] contains 330K examples of machine annotated scientific documents crawled from the Internet. Its objects come from 5 categories: text, title, figure, list, and table. Magazine [zheng2019content] contains 4K images of magazine pages and 6 categories (texts, images, headlines, over-image texts, over-image headlines, backgrounds). Image Ads [lee2020neural] is the commercial ads dataset with layout annotation detailed in [lee2020neural]. COCO [lin2014microsoft] contains 100K images of natural scenes. We follow [arroyo2021variational] to use the Stuff variant, which contains 80 things and 91 stuff categories, after removing small bounding boxes ( 2% image area), and instances tagged as “iscrowd”. 3D-FRONT [fu20203dfront] is a repository of professionally designed indoor layouts. It contains around 7K room layouts with objects belonging to 37 categories, e.g., the table and bed. Different from previous datasets, objects in 3D-FRONT are represented by the 3D bounding boxes.

Evaluation metrics

Prior works use multiple metrics to assess the quality of generated layouts from various perspectives. We employ three common metrics in the literature and introduce a new metric for conditional generation. Specifically, IOU measures the intersection over union between the generated bounding boxes. We compute the IOU in the pixel space. Overlap [li2019layoutgan] measures the total overlapping area between any pair of bounding boxes inside the layout. Alignment [lee2020neural] computes an alignment loss with the intuition that objects in graphic design are often aligned either by center or edge (e.g., left- or right-aligned). For all IOU, Overlap, Alignment, the lower the better.

The above metrics do not consider user inputs, and we need a metric for conditional generation. We introduce Similarity which compares the generated layout with the real layout under the same input condition. We use DocSim [zheng2019content] to calculate the similarity between two layouts.

Generation settings

We examine three practical layout generation scenarios (2 conditional and 1 unconditional).

  • Conditional on Category: only object categories are given by users. The model needs to predict the size and position for each object.

  • Conditional on Category + Size: users specify the object category and size for each object, and the model needs to predict the position, i.e. placing the objects on the canvas.

  • Unconditional Generation: no information is provided by users. Users have little control of the generation process. Prior works mainly focus on this setting.

In unconditional generation, the model generates 1K samples from the random seed. The test split of each dataset is used for conditional generation.

Implementation details

The model is trained for five trials with random initialization and the averaged metrics with standard deviations are reported. All models including ours have the same configuration,

i.e., 4 layers, 8 attention heads, 512 embedding dimensions and 2,048 hidden dimensions. Adam optimizer [kingma2014adam] with and is used. Models are trained on 22 TPU devices with batch size 64. For conditional generation, we randomly shuffle objects in the layout. For unconditional generation, to improve diversity, we use the nucleus sampling [holtzman2019curious] with for the baseline Transformers and the top-k sampling () for our model. Greedy decoding method is used for conditional generation. Please refer to the supplementary material for more detailed hyper-parameter configurations.

5.2 Quantitative Comparison

Conditional generation The results are shown in Table 12. The state-of-the-art layout models are compared i.e. LayoutTransformer (Trans.) [gupta2020layout]

and Variational Transformer Network (VTN) 

[arroyo2021variational]. Two conditional generation tasks are examined i.e. conditional on category (Column 2-4) and on category + size (Column 5). The same model is used for both conditional cases and the baseline models are not able to perform the category + size case. The results are obtained on five layout benchmarks, where the mean (over 5 trails) and standard deviation are reported.

Attributed to the flexible bidirectional self-attention, our model is able to conduct conditional generation on category + size while the baseline models fail. On the conditional generation on Category, our model achieves the best result on the important Similarity metric (Column 5 and 6), suggesting that our model generate layouts that can better meet user needs. On other metrics, it yields reasonable results compared to the strong baselines. We note the worse Overlap metric on the COCO dataset which is because object bounding boxes do overlap in natural scene images.

We also extend our comparison to 3D layout generation on 3D-FRONT [fu20203dfront] (cf. Fig. 5). The result in Table 3 also verifies the advantage of our method.

Model Magazine RICO
NDN-none (unconditional) 2.510.09 0.910.03
NDN-all (conditional) 1.510.09 0.320.02
Ours (conditional) 0.450.03 0.250.11
Table 5: Comparison to the state-of-the-art conditional VAE [lee2020neural] using the metric Alignment on RICO and Magazine.

Unconditional Generation Although our model is not designed for such purpose, we compare it to recent models [jyothi2019layoutvae, gupta2020layout, arroyo2021variational] on unconditional layout generation. From Table 4, our model outperforms LayoutVAE [jyothi2019layoutvae] and achieves comparable performance with two autoregressive transformers (Trans. [gupta2020layout] and VTN [arroyo2021variational]). Furthermore, we compare our method with a state-of-the-art VAE model (called NDN) [lee2020neural] by their proposed metric “Alignment”. We demonstrate better results when the model is given no constraints (NDN-none) and all constraints (NDN-all).

(a) head 0-2
(b) head 1-3
(c) head 2-4
(d) head 3-2
Figure 3: Examples of attention heads exhibiting the patterns for masked tokens. The darkness of a line indicates the strength of the attention weight (some attention weights are so low they are invisible). We use layer-head number to denote a particular attention head.
Figure 4: Conditional layout generation for scientific papers, user interface, and magazine. The user inputs are the object category and their size (width, height). We also present the rendered examples constructed based on the generated layouts.
Figure 5: 3D-FRONT sample layouts.

5.3 Qualitative Result

We show some generated layouts, along with the rendered examples for visualization, in Fig. 4. The setting is conditional generation on category and size for three design applications including the mobile UI interface, scientific paper, and magazine. We observe that our method yields realistic layouts which facilitates generating high-quality output by rendering, suggesting that our model is capable of capturing position relationships between objects.

Next we explore the home design task on the 3D-Front dataset [fu20203dfront]. The goal is to place the furniture with the user-given category and length, height, and width. Examples are shown in Fig. 5. Unlike the previous tasks, here the model needs to predict the position for the 3D bounding box. The low similarity score on this dataset indicate that housing design layout is still a challenging task and we will explore this in future work.

Where to pay attention?

To further understand what relationships between attributes BLT is able to learn, we visualize the patterns in how our model’s attention heads behave. We choose a simple layout with two objects and mask their positions (, ). The model needs to predict these masked attributes from other known attributes. Examples of heads exhibiting these patterns are shown in Fig. 3. We use layer-head number to denote a particular attention head. For the head 0-2, specializes to attending on its category (c2) and especially, its height information (h2), which is reasonable because -coordinate is highly relevant to the height of the object. Furthermore, for heads 2-4 and 3-2, focuses on the width of not only the first but the second object as well. Given this contextual information from other objects, the model is able to predict the position of these objects more accurately. The similar pattern is also found at head 3-2 for .

5.4 Ablation Study

Attribute order in the decoding algorithm

In Algorithm 1, we prespecify an order of attribute groups, which is Category (C) Size (S) Position (P). Here, more orders are explored including the “No Order” in BERT [devlin2019bert] where attributes are individually selected without using group. The result is shown in Table 6. “No Order” performs clearly worse than the other orders, demonstrating the necessity of our algorithm. Moreover, among pre-defined orders, it seems better to first generate the category and afterward determine either location or size.

Decoding speed w.r.t. the sequence length

Figure 6: Decoding speed versus number of generated assets. ‘Autoregressive’ denote the autogressive Transformer-based model. ‘Iter-*‘ shows the proposed model with various #iterations.

We compare the inference speed of our model and the autoregressive transformer models [gupta2020layout, arroyo2021variational]. Specifically, all models generate 1,000 layouts with batch size 1 on a single GPU and the average decoding time in millisecond is reported. The result is shown in Fig.6, where the -axis denotes the number of objects in the layout. It shows that autoregressive decoding time grows with #objects. On the contrary, the decoding speed of the proposed model appears not affecting by #objects. The speed advantage becomes more evident when producing dense layouts. For example, our fastest model obtains a 4x speed-up when generating around 10 objects and a 10x speed-up for 20 objects.

Iterative Refinement Process

To understand the process of our iterative refinement algorithm, we show a sample of generated layouts at different number of iterations in Fig.8. At the first iteration, there are severe overlaps between objects, showing the difficulty to yield high-quality layouts with just one pass. However, after iteratively refining low-confident attributes, the layouts become more realistic.

Quantitatively, IOU and Overlap metrics, where lower the better, are plotted in Fig.7 along with #iterations. With more iterations, the quality metrics are getting improved and stable. This result is consistent with our qualitative observation above on the Magazine and PubLayNet datasets.

Figure 7: IOU and overlap scores with different number of iterations on Magazine and PubLayNet.
Figure 8: Layouts refinement process. Layouts generated at different iterations () are shown on three datasets.
IoU Overlap Alignment
CSP 0.127 0.102 0.342
CPS 0.129 0.107 0.344
SCP 0.147 0.109 0.351
SPC 0.162 0.121 0.357
No Order 0.208 0.193 0.374
Table 6: Layout generation results with different iteration group orders on the RICO dataset. C, S and P denote category, size and position attribute groups, respectively.

6 Conclusion and Future Work

We present BLT, a bidirectional layout transformer capable of generating layout attributes in parallel. We design an iterative refinement algorithm to generate high-quality layouts in a few rounds. Compared to the autoregressive Transformer, our model is suitable for flexible conditional layout generation where every attribute is controlled by the users. Experimental results on several benchmarks demonstrate the effectiveness and flexibility of BLT.

Negative Societal Impact and Limitation BLT may be applied to questionable and potentially harmful applications. A limitation of our work is content-agnostic conditional generation. We leave this out to have a fair comparison to our baselines which do not use visual information either. In the future, we will explore using rich visual information.


Appendix: BLT: Bidirectional Layout Transformer for Controllable Layout Generation

Appendix A Training Details

To find out the optimal hyperparameters for each task, we use a simple grid search for the following ranges of possible values, learning rate in

, dropout and attention dropout in . The data preprocessing procedure discussed in [arroyo2021variational] is used. In particular, for RICO dataset, [arroyo2021variational] removes layouts with more than 100 objects, ignoring 0.03% of the data. Since our model is trained on TPU with stricter memory constraints, we omit layouts with more than 50 objects, in total removing around 0.5% of the data.

Appendix B Evaluation Metrics

In [arroyo2021variational], the author calculate the IOU scores between all pairs of overlapped objects and average them. In our work, we compute the IOU score by projecting the layouts into the discrete space (, which is the ovarlapped area divided by the union area of all objects. We show the difference via a toy example in Fig. 9. The areas of objects , and are 5, 1, 1, the overlapped area of and are 0.5. Based on their method, since they just care about overlapped objects, only the IOU of objects B and C are computed which is . On the contrary, in our method, the overlapped area of and will be divided the union area of all objects, hence, the IOU of this layout is which is more reasonable than their result.

Figure 9: An toy layout sample for the IOU computation. The metrics used yields more reasonable IOU than the IOU used in [arroyo2021variational]

Appendix C Diverse Conditional Generation

In our main experiment, we use greedy search to find out the most likely candidate for each attribute at each iteration. Here, we generate layouts through sampling the top-k () from the likelihood distribution for category and size conditional generation. This leads to generation diverse layouts. Some examples are shown in Fig. 11.

Appendix D Additional Visualization

d.1 More Attention Head Patterns

Patterns for other heads at different layers are listed in Fig. 10. We could find that for masked position (head 1-1 and head 2-6, etc.), their heads will attend to width information of various objects for accurate perdition. And similar findings could be found for other heads.

(a) head 1-1
(b) head 1-3
(c) head 1-6
(d) head 0-4
(e) head 2-6
(f) head 2-5
(g) head 3-6
(h) head 3-7
Figure 10: More examples of attention heads exhibiting the patterns for masked tokens. The darkness of a line indicates the strength of the attention weight (some attention weights are so low they are invisible). We use layer-head number to denote a particular attention head.
Figure 11: Diverse conditional generation via top- sampling method.
Figure 12: Conditional layout generation for scientific papers, user interface, and magazine. The user inputs are the object category and their size (width, height). We compare the generated layout and the real layout with the same input in the dataset.

d.2 Qualitative Results

We show more samples in Fig. 12 from conditional generation on category and size for four design applications including the mobile UI interface, scientific paper, magazine and natural scenes.

d.3 Iterative Refinement Process

We list more samples for iterative refinement process in Fig. 13. Severe overlaps between objects will be mitigated with more iterations. The quantitative quality metrics for the layout generated at each iteration is compared in Figure 7 of the main paper.

Figure 13: More layouts refinement process. Layouts generated at different iterations () are shown on three datasets.
Figure 14: Failure cases for layout generation using the propose method. We compare the generated layout and the real layout with the same input in the dataset. See Section D.4 for more discussion.

d.4 Failure Cases

Some undesired conditional generation results are shown in Fig 14. Similar to other layout generation models, there are some overlaps between objects in some generation results. Furthermore, some generated samples are largely different from the real layouts with low visual quality. For example, in the second sample on the Magazine, the alignment of the generated sample is worse than its corresponding real layout. We will explore these directions in the future work.