We address the problem of layout generation for diverse domains such as images, documents, and mobile applications. A layout is a set of graphical elements, belonging to one or more categories, placed together in a meaningful way. Generating a new layout or extending an existing layout requires understanding the relationships between these graphical elements. To do this, we propose a novel framework, LayoutTransformer, that leverages a self-attention based approach to learn contextual relationships between layout elements and generate layouts in a given domain. The proposed model improves upon the state-of-the-art approaches in layout generation in four ways. First, our model can generate a new layout either from an empty set or add more elements to a partial layout starting from an initial set of elements. Second, as the approach is attention-based, we can visualize which previous elements the model is attending to predict the next element, thereby providing an interpretable sequence of layout elements. Third, our model can easily scale to support both a large number of element categories and a large number of elements per layout. Finally, the model also produces an embedding for various element categories, which can be used to explore the relationships between the categories. We demonstrate with experiments that our model can produce meaningful layouts in diverse settings such as object bounding boxes in scenes (COCO bounding boxes), documents (PubLayNet), and mobile applications (RICO dataset).READ FULL TEXT VIEW PDF
In the real world, there exists a strong relationship between different objects that are found in the same environment [36, 34]. For example, a dining table usually has chairs around it; a surfboard is found near the sea; horses do not ride cars; etc. Biederman  provided strong evidence in cognitive neuroscience that perceiving and understanding a scene involves two related processes: perception and comprehension. Perception deals with processing the visual signal or the appearance of a scene. Comprehension deals with understanding the schema of a scene , where this schema (or layout) can be characterized by contextual relationships between objects (e.g., support, occlusion, and relative likelihood, position, and size ). For generative models that synthesize scenes, this evidence underpins the importance of two factors that contribute to the realism or plausibility of a generated scene: layout, i.e.the arrangement of different objects, and their appearance (in terms of pixels). Therefore, generating a realistic scene necessitates both these factors to be plausible.
The advancements in the generative models for image synthesis have primarily targeted plausibility of the appearance signal by generating incredibly realistic images of objects (e.g., faces [20, 19], animals [3, 46]). In order to generate complex scenes [6, 15, 17, 26, 32, 47, 48], most methods require proxy representations for layouts to be provided as inputs (e.g.scene segmentation [16, 40], textual descriptions [26, 47, 33], scene graphs ). We argue that to plausibly generate large and complex scenes without such proxies, it is necessary to understand and generate the layout of the scene, in terms of contextual relationships between various objects present in the scene.
The layout of a scene, capturing what objects occupy what parts of the scene, is an incredibly rich representation. A plausible layout needs to follow prior knowledge of world regularities [2, 7]; for example, the sky is above the sea, indoor scenes have certain furniture arrangements etc.. Learning to generate layouts is a challenging problem due to the variability of real-world layouts. Each scene contains a small fraction of possible objects, objects can be present in a wide range of locations, the number of objects varies for each scene and so do the contextual relationships between objects (e.g., a person can carry a surfboard or ride a surfboard, a person can ride horse but not carry it). We parameterize each object or element of the layout by semantic attributes/categories, location, and size. In order to realize plausible semantic relationships, a generative model for layouts should be able to look at all existing objects and propose placement of a new object.
We propose a sequential and iterative approach for modeling layouts that uses self-attention to look at existing layout elements. Our generative process can start from an empty set or an unordered set of elements already present in the scene, and can iteratively attend to existing elements to generate a new element. Moreover, by predicting either to stop or to generate the next element, our sequential approach can generate variable length layouts.
Our approach can be readily plugged into many scene generation frameworks (e.g., Layout2Image , GauGAN ). However, layout generation is not limited to scene layouts. Several stand-alone applications require generating layouts or templates. For instance, in the UI design of mobile apps and websites [8, 28], an automated model for generating plausible layouts can significantly decrease the manual effort and cost of building such apps and websites. Finally, a model to create layouts can potentially help generate synthetic data for augmentation tasks [44, 4] or 3D scenes [5, 43, 42].
We summarize the contributions of our work as follows:
We develop a simple yet powerful auto-regressive model for generating layouts that can synthesize new layouts, complete partial layouts, and compute likelihood of existing layouts. Our self-attention approach allows us to visualize what existing elements are important for generating the next category in the sequence,
We propose modeling position and size of layout elements with discrete multinomial distribution that enables our approach to generalize across very diverse data distributions,
We present an exciting finding – encouraging a model to understand layouts results in feature representations that capture the semantic relationships between objects automatically (without explicitly using semantic embeddings, like word2vec ). This demonstrates the utility of the task of layout generation as a proxy-task for learning semantic representations,
Generative models. Deep generative models in recent years have shown a great promise in terms of faithfully learning a given data distribution and sampling from it. Approaches such as variational auto-encoders  try to maximize approximate log-likelihood of data generated from a known distribution. Auto-regressive and flow-based approaches such as Pixel-RNN , RealNVP  can compute exact log-likelihood but are inefficient to sample from. GANs  are arguably the most popular generative models demonstrating state-of-the-art image generation results [3, 20], but do not allow log-likelihood computation. Most of these approaches and their variations work well when generating entire images, especially for datasets of images with a single entity or object. While these models can generate realistic objects, they often fail to capture global semantic and geometric relations between objects, which are needed to generate more complex realistic scenes .
Scene generation. Most works in 2D or 3D scene synthesis generate a scene conditioned on a sentence [26, 47, 33], a scene graph [17, 24, 1], a layout [10, 13, 16, 41] or an existing image . These involved pipelines are often trained and evaluated end-to-end, and surprisingly little work has been done to evaluate the layout generation component itself. Given the input, some works generate a fixed layout and diverse scenes , while other works generate diverse layouts and scenes [17, 26]. Again the focus of all these works is on the quality of the final image and not the feasibility of the layout itself. In this work, we evaluate the layout modeling capabilities of two of the recent works [17, 26] that have layout generation as an intermediate step.
Layout generation. Synthesising scene layouts is a relatively under-explored problem, mainly because generative models do not work well in practice when modeling sets of elements rather than images. LayoutGAN  attempts to solve the problem by starting with maximum number of possible elements in the scene and modifying their geometric and semantic properties. LayoutVAE  starts with a label set, i.e., categories of all the elements present in the layout and then generates a feasible layout of the scene. Wang et al. [40, 39] generate layout of an indoor room starting from top-down view of the room. However, their method is very specific to indoor rooms data and make assumptions about presence of walls, roof etc., and hence cannot be easily extended to other datasets. Zheng et al.  attempt to generate document layouts given the images, keywords and category of the document.
Our approach, LayoutTransformer, offers several advantages over current layout generation approaches without sacrificing their benefits. Unlike [25, 49] autoregressive nature of model allows us to generate layouts of arbitrary lengths as well as start with partial layouts. We observe that modeling the position and size of layout elements as discrete values (as discussed in 3.1) helps us realize better performance on datasets, such as documents and app wireframes, where bounding boxes of layouts are typically axis aligned. Finally, assumptions of various kind of inputs limit the applicability of existing methods to diverse datasets. For example, to use scene graphs [17, 24] as input, relationships need to be redefined for different datasets, text [26, 47, 33] need not exist for documents or wireframes, and to images can be used as input for documents but not for images. We get rid of lot of assumptions and simplify our layout generation pipeline to such an extent that it can be used to synthesize layouts for very diverse datasets. With thorough experimentation, we show the superiority of LayoutTransformer over three diverse real world datasets COCO Bounding Boxes , PubLayNet Documents , and Rico Mobile App Wireframes .
Layouts are distinct from scenes or images in several ways. There is a strong non-local relationships between different elements in layouts. For example, presence of a small bird in corner of a scene changes distribution of objects present in rest of the scene, or a figure in document decides where the text can go. In case of images, on the other hand, local relationships play a more dominant role, i.e., pixels close to each other are likely to be similar in values. While CNN based architectures such as VAE and GANs are excellent at generating pixels or images, an intuitive way to go about modeling distribution of complex scenes would be to use an auto-regressive model that can look at all existing elements, near or far, to generate semantic relationships followed by a convolutional architecture to generate the final image.
In this section, we propose LayoutTransformer, an auto-regressive self-attention network architecture to model the layouts. It allows us to learn non-local semantic relationships between layout elements and also give us flexibility to work with variable length layouts. We first describe the problem setup and follow it up with details of network architecture and training.
We represent each layout as a sequence of graphical elements comprising the layout. For two-dimensional datasets such as documents, images, and wireframes, each graphical element can be further represented by a bounding box of category , located at , of size
. The goal of the layout model is now to learn the joint distribution of category, location, and size of layout elements represented by a tuple. We order all the elements in the raster scan order i.e.first by coordinate, followed by coordinate. After reordering, we concatenate all the graphical elements in a flat sequence. We also append two special symbols and to denote start and end of sequence. Hence, our layout of graphical elements can be now represented as a sequence
For simplicity, we use to represent any element in the flattened sequence of the tuples, i.e.,
. We now pose the problem of modeling this joint distribution as product over series of conditional distributions using chain rule:
Representation of layout element. As introduced earlier, each layout element is represented by its category and the bounding box enclosing it. Instead of treating the location and size of bounding boxes as continuous variables, we model them as a discrete distribution where each
is obtained as the output of a softmax layer. Apart from allowing us to represent each ofin a simple and consistent manner, this strategy has the additional advantage that it does not assume a prior on position and size of bounding boxes and lets the network model them arbitrarily. If we divide our layout in grid cells, each element of , where we use to denote the size of vocabulary
In Fig. 5, we show that discretizing position and shape in this way is particularly advantageous for datasets such as document layouts when bounding boxes are aligned to each other. We also discuss a variation of this strategy in ablation studies.
Ordering of elements. The sequence of elements is important in order to train our model. We use a simple raster scan order of layout elements in our case. We show the impact of removing this ordering strategy and using an arbitrary order in ablation studies.
We use a Transformer Decoder 2 shows the network architecture for a toy example. Each layout element gets mapped to a -dimensional embedding such that . These embeddings are then passed to a sequence of masked self-attention layers. The masking is done so that model predicts the probability of an element using only the embeddings of previous layout elements. This means
where represent the masked multi-headed self-attention transformer decoder module. It includes a softmax layer in the end to normalize the output values between 0 and 1. Instead of using a standard cross-entropy loss, we follow the approach commonly used in Transformer-like architectures. For every groundtruth element in the sequence , we create a pseudo groundtruth distribution using Label Smoothing  with high confidence at the groundtruth index and rest of the mass distributed uniformly. We then minimize the KL-Divergence loss of predictions with this distribution using . Label Smoothing impacts the perplexity of the model adversely but prevents the model from becoming overconfident.
In our base model, we use , , and in the decoder. Label smoothing uses an . We observe that our model is quite robust to these choices, as we show in the ablation studies. The rest of the details of network architecture and training are in the supplementary material.
At training and validation time, we use teacher forcing, i.e., since we know all the sequences, we use groundtruth sequences of variable lengths to train our model efficiently. To sample a new layout, we can start off with either just a start of sequence token or a set of tokens . A naïve decoding strategy would be greedy decoding, i.e., we predict the next element with the highest probability, append it to the existing sequence, and repeat till we reach the end of the sequence . A better way would be to do a beam search , i.e., keep track of most likely sequences while decoding to generate multiple possible layouts starting from the same initial sequence.
In this section, we first discuss datasets used for evaluation, followed by qualitative results of our approach on these datasets. We then analyze the performance of LayoutTransformer on general and dataset-specific quantitative metrics.
We evaluate the performance of our approach on multiple diverse datasets. Specifically, we use a toy MNIST Layout dataset as proposed in LayoutGAN , Rico Mobile App Wireframes [8, 28], COCO bounding boxes  and PubLayNet Documents . Note that each of the datasets involves a pre-processing step, and we tried to faithfully replicate these steps as provided in original publications [25, 18]. We will release the code for pre-processing, our approach, and experiments for reproducibility and future reference.
MNIST Layout. Following LayoutGAN , in all images, we consider pixels with values greater than a fixed threshold (fixed to in all experiments) as foreground pixels. Each image is now represented by a set of randomly selected foreground pixel indices with a minimum of and a maximum of indices per image. Overall, we get training and validation layouts from the original train/val split of MNIST. The mean and median length of layouts is and , respectively.
COCO bounding boxes. COCO bounding boxes dataset is obtained using bounding box annotations in COCO Panoptic 2017 dataset . We ignore the images where the isCrowd flag is true following the LayoutVAE  approach. The bounding boxes come from all thing and stuff categories. Our final dataset has layouts from COCO train split with a median length of elements and layouts from COCO valid split with a median length of .
Rico Mobile App Wireframes. Rico mobile app dataset [8, 28] consists of layout information of more than unique UI screens from over android apps. Each layout consists of one or more of the categories of graphical elements such as text, image, icon etc. A complete list of these elements and frequency of their appearances is provided in the supplementary material. Overall, we get 62951 layouts in Rico with a median length of 36.
PubLayNet. PubLayNet  is a large scale document dataset recently released by IBM. It consists of over million documents collected from PubMed Central. The layouts are annotated with element categories - text, title, list, label, and figure. We filter out the document layouts with over 128 elements. Our final dataset has layouts from PubLayNet train split with a median length of elements and layouts from PubLayNet dev split with a median length of .
We consider following baseline approaches:
LayoutVAE. LayoutVAE  uses a similar representation for layout and consists of two separate autoregressive VAE models. Starting from a label set, which consists of categories of elements that will be present in a generated layout, their CountVAE generates counts of each of the elements of the label set. After that BoundingBoxVAE, generates the location and size of each occurrence of the bounding box.
ObjGAN. ObjGAN  provides an object-attention based GAN framework for text to image synthesis. An intermediate step in their image synthesis approach is to generate a bounding box layout given a sentence using a BiLSTM (trained independently). We adopt this step of the ObjGAN framework to our problem setup. Instead of sentences we provide categories of all layout elements as input to the ObjGAN and synthesize all the elements’ bounding boxes.
sg2im. Image generation from scene graph  attempts to generate complex scenes given scene graph of the image by first generating a layout of the scene using graph convolutions and then using the layout to generate complete scene using GANs. The system is trained in an end-to-end fashion. Since sg2im requires a scene graph input, following the approach of , we create a scene graph from the input and reproduce the input layout using the scene graph.
Generated layout samples. Figure 3 shows some of the generated samples of our model from different datasets. Note that our model can generate samples from empty or partial layouts. We demonstrate this by taking partial layouts from validation data and generating a full layout with greedy decoding.
Nearest neighbors. To see if our model is memorizing the training dataset, we compute nearest neighbors of generated layouts using chamfer distance on top-left and bottom-right bounding box coordinates of layout elements. Figure 4 shows the nearest neighbors of some of the generated layouts from the training dataset. Our model is able to generate novel layouts not present in the training data.
Visualizing attention. The self-attention based approach proposed enables us to visualize which existing elements are being attending to while the model is generating a new element. This is demonstrated in Figure 6.
We posited earlier that capturing layout should capture contextual relationships between various elements. We provide further evidence of our argument in three ways. We visualize the 2D-tsne plot of the learned embeddings for categories, as shown in Figure 8. We observe that super-categories from COCO are clustered together in the embedding space of the model. Certain categories such as window-blind and curtain (which belong to different super-categories) also appear close to each other. Table 2 captures the most frequent bigrams and trigrams (categories that co-occur) in real and synthesized layouts. Table 2 shows word2vec  style analogies being captured by embeddings learned by our model. Note that the model was trained to generate layouts and we did not specify any additional objective function for analogical reasoning task. These observations are in line with observations of Gupta et al .
|other person||other person||person other person||other person clothes|
|person other||person clothes||other person clothes||person clothes tie|
|person clothes||clothes tie||person handbag person||tree grass other|
|clothes person||grass other||person clothes person||grass other person|
|chair person||other dining table||person chair person||wall-concrete other person|
|person chair||tree grass||chair person chair||grass other cow|
|sky-other tree||wall-concrete other||person other clothes||tree other person|
|car person||person other||person backpack person||person clothes person|
|person handbag||sky-other tree||person car person||other dining table table|
|handbag person||clothes person||person skis person||person other person|
|snowboard:snow::surfboard:?||waterdrops, sea, sand|
|car:road::train:?||railroad, platform, gravel|
|sky-other:clouds::playingfield:?||net, cage, wall-panel|
|mouse:keyboard::spoon:?||knife, fork, oven|
|fruit:table::flower:?||potted plant, mirror-stuff|
Quantitative evaluation methods of the layout generation problem differ from dataset to dataset. In this section, we discuss some of these methods applicable to the datasets under consideration.
Downstream Task - Image generation. To evaluate the ability of layout generation approaches in generating plausible layouts for the COCO dataset we use Layout2Im  to generate an image from a layout. Table 3 shows images generated from layouts by our method and LayoutVAE. We also compute compute Inception Score (IS) and Fréchet Inception Distance (FID) to compare quality and diversity of generated images. Our method improves upon LayoutVAE in both the metrics.
Negative log-likelihood. For each of the datasets, Table 5 shows the negative log-likelihood of all the layouts in validation set. The results indicate that our approach generates more plausible layouts (more details on this are provided in the supplementary material).
Dataset statistics. Depending on the dataset and definition of graphical elements, we can define statistics that layouts should follow. For Rico wireframes and PubLayNet docs, we compare two important statistics of layouts in Table 5. Overlap represents the intersection over union (IoU) of various layout elements. Generally in these datasets, elements do not overap with each other and Overlap is small. Coverage indicates the percentage of canvas covered by the layout elements. The table shows that layouts generated by our method resemble real data statistics better than LayoutGAN and LayoutVAE.
|sg2im ||25.2 (46)||16.5 (31)||30.2 (26)||3.4 (12)|
|ObjGAN ||39.2 (33)||36.4 (29)||38.9 (12)||8.2 (7)|
|LayoutVAE ||41.5 (29)||34.1 (27)||40.1 (11)||14.5 (11)|
|Ours||33.6 (27)||23.7 (33)||47.0 (12)||0.13 (1.5)|
|Real Data ()||30.2 (25)||20.5 (30)||47.8 (9)||0.02 (0.5)|
|Real Data||36.6 (27)||22.4 (32)||57.1 (10)||0.1 (0.6)|
Effect of other hyperparameters on NLL
We evaluate the importance of different model components with negative log-likelihood on COCO layouts. The ablation studies clearly show the following:
Small, medium and large elements: NLL of our model for COCO large, medium, and small boxes is 2.4, 2.5, and 1.8 respectively. We observe that even though discretizing box coordinates introduces approximation errors, it later allows our model to be agnostic to large vs small objects.
Varying : decides the resolution of the layout. Increasing it allows us to generate finer layouts but at the expense of a model with more parameters. Also, as we increase the , NLL increases, suggesting that we might need to train the model with more data to get similar performance (Table 7).
Size of embedding: Increasing the size of the embedding improves the NLL, but at the cost of increased number of parameters (Table 7).
Model depth: Increasing the depth of the model , does not significantly improve the results (Table 9). We fix the in all our experiments.
Ordering of the elements: The self-attention layer in our model is invariant to the ordering of elements. Therefore, while predicting the next element of the layout, we do not consider the order in which the elements were added. However, in our experiments, we observed that predicting the elements in a simple raster scan order of their position improves the model performance both visually and in terms of negative log-likelihood. This is intuitive as filling the elements in a pre-defined order is an easier problem. We leave the task of optimal ordering of layout elements to generate layouts for future research (Table 9).
Discretization strategy: Instead of the next bounding box, we tried predicting the x-coordinate and y-coordinate of the bounding box together (refer to the Split-xy column of Table 9). This increases the vocabulary size of the model (since we use possible locations instead of ) and in turn the number of hyper-parameters with decline in model performance. An upside of this approach is that generating new layouts takes less time as we have to make half as many predictions for each element of the layout (Table 9).
We propose LayoutTransformer, a novel approach to generate layouts of graphical elements. Our model uses self-attention model to capture contextual relationship between different layout elements and generate novel layouts. We show that our model is better than previously proposed models for layout generation due to its ability to synthesize layouts from an empty set or complete a partial layout. The model can also produce layouts with a variable number of elements and categories. We show that our model can produce qualitatively better layouts than the state-of-the-art approaches for diverse datasets such as Rico Mobile App Wireframes, COCO bounding boxes, and PubLayNet documents. While we demonstrated results for our approach by generating layouts in two dimensions, this framework is applicable to three dimensional scenes as well. One limitation of the model is that while it is capable of predicting size and location of objects of the scene, it cannot be easily extended to predict object masks. We will explore these directions in future work.
Acknowledgements. We thank Yuting Zhang, Luis Goncalves, Stefano Soatto and Guha Balakrishnan for many helpful discussions. This work was partially supported by DARPA via ARO contract number W911NF2020009.
Proceedings of the IEEE International Conference on Computer Vision, pp. 4561–4569. Cited by: §2.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7986–7994. Cited by: §1.
Learning to extract semantic structure from documents using multimodal fully convolutional neural network. CoRR abs/1706.02337. External Links: Cited by: §1.
In all our experiments, our base model consists of , , , and where is the number of grid cells (corresponding to locations and size of bounding boxes), is the dimension of embedding in each of the layers, is the number of self-attention layers, is the number of heads in each of the self-attention layer, and is the number of units in feedforward layer that follows the self-attention part in a single decoder block. We also use a dropout of at the end of each feedforward layer for regularization. Starting with start token , our model predicts category, location and shape of next bounding box in a raster scan order. We fix the the maximum number of elements in each of the datasets to which covers over 99.9% of the layouts in each of the COCO, Rico and PubLayNet datasets.
We also used Adam optimizer 
with Noam learning rate scheduling. We train our model for 20 epochs for each dataset with early stopping based on maximum log likelihood on validation layouts (overall we trained our model for 8 epochs on COCO, 12 epochs on Rico, and 16 epochs on PubLayNet). Our COCO Bounding Boxes model takes about 2 hours to train on a single NVIDIA GTX1080 GPU. Batching matters a lot to improve the training speed. We want to have evenly divided batches, with absolutely minimal padding. We sort the layouts by the number of elements and search over this sorted list to use find tight batches for training.
In this section, we provide more details on various qualitative and qualitative evaluation done in Section 4. We also provide some additional results (that didn’t fit in the paper).
We show random samples generated for each of the dataset using different methods listed in Section 4.2. In case of LayoutVAE , we use label set of layouts in validation dataset as input to generate samples. In case of LayoutTransformer, we take one layout element from validation dataset and complete the layout using our model.
While there is no standard method for comparing two layouts, in Section 4.3 of the paper, we use modified Chamfer distance111https://github.com/ThibaultGROUEIX/AtlasNet in order to compare two layouts (with different number of elements). This metric doesn’t take into account categories of various layout elements and just compute euclidean distance between bounding box top left coordinates and height and width.
In order to compute NLL for layouts generated by LayoutVAE, we follow the approach as provided by the authors in their paper. Using teacher forcing, for a layout in validation set we compute Monte Carlo estimate of NLL by drawing 1000 samples from conditional prior. We add the NLL for CountVAE and BBoxVAE.
Since in our method it is straightforward to compute likelihood of a layout, we can use our method to test if a given layout is likely or not. Figure 12 shows the NLL given by our model by doing left-right and top-down inversion of layouts in COCO (following ). In case of COCO, if we flip a layout left-right, we observe that layout remains likely, however flipping the layout upside decreases the likelihood (or increases the NLL of the layout). This is intuitive since it is unlikely to see fog in the bottom of an image, while skis on top of a person.
represents each layout with a fixed number of bounding boxes. Starting with bounding box coordinates sampled from a Gaussian distribution, its GAN based framework assigns new coordinates to each bounding box to resemble the layouts from given data. Optionally, it uses non-maximum suppression (NMS) to remove duplicates. The problem setup in LayoutGAN is similar to the proposed approach and they do not condition the generated layout on anything. Like many GAN setups, LayoutGAN is non-trivial to train across multiple datasets. In our implementation of LayoutGAN, we were unable to prevent mode collapse for all datasets except for MNIST.
In this section, we share statistics of different elements and their categories in our dataset. In particular, we share the total number of occurrences of an element in the training dataset (in descending order) and the total number of distinct layouts an element was present in throughout the training data. Table 12 show these statistics for COCO bounding boxes, Tables 10, 10 show the statistics for Rico wireframes, and table 11 show the statistics for PubLayNet documents.
|Category||# occurrences||# layouts|
|Category||# occurrences||# layouts|
|Category||# occurrences||# layouts|
|Category||# occurrences||# layouts|
|Category||# occurrences||# layouts|
|Category||# occurrences||# layouts|