Transform the Set: Memory Attentive Generation of Guided and Unguided Image Collages

by   Nikolay Jetchev, et al.

Cutting and pasting image segments feels intuitive: the choice of source templates gives artists flexibility in recombining existing source material. Formally, this process takes an image set as input and outputs a collage of the set elements. Such selection from sets of source templates does not fit easily in classical convolutional neural models requiring inputs of fixed size. Inspired by advances in attention and set-input machine learning, we present a novel architecture that can generate in one forward pass image collages of source templates using set-structured representations. This paper has the following contributions: (i) a novel framework for image generation called Memory Attentive Generation of Image Collages (MAGIC) which gives artists new ways to create digital collages; (ii) from the machine-learning perspective, we show a novel Generative Adversarial Networks (GAN) architecture that uses Set-Transformer layers and set-pooling to blend sets of random image samples - a hybrid non-parametric approach.


page 1

page 2

page 4

page 5

page 6

page 7


Constrained Generative Adversarial Networks for Interactive Image Generation

Generative Adversarial Networks (GANs) have received a great deal of att...

Copy the Old or Paint Anew? An Adversarial Framework for (non-) Parametric Image Stylization

Parametric generative deep models are state-of-the-art for photo and non...

Deep Spatial Transformation for Pose-Guided Person Image Generation and Animation

Pose-guided person image generation and animation aim to transform a sou...

Memory-Driven Text-to-Image Generation

We introduce a memory-driven semi-parametric approach to text-to-image g...

Deep Image Synthesis from Intuitive User Input: A Review and Perspectives

In many applications of computer graphics, art and design, it is desirab...

Attentive Normalization for Conditional Image Generation

Traditional convolution-based generative adversarial networks synthesize...

LaserSVG: Responsive Laser-Cutter Templates

Laser cutters take vector data for the shapes they cut or engrave as inp...


The authors would like to give special thanks to Kashif Rasul, whose PyTorch expertise greatly aided the project, and to Roland Vollgraf for the many useful generative model discussions.

Ethical Considerations

The MAGIC algorithm we presented here is a tool for digital collages. It allows artist practitioners to experiment and iterate faster when exploring various artistic choices, e.g. choice of source material memory templates as style or which content image to stylize. While the new tool amortizes the costs of the collage process – faster than either manually cutting paper or using traditional optimization based tools – this tool is inherently meant to be a part of a collaboration between artist and AI machine. In that sense, we would say that our tool is ethical and does not risk radical negative disruption of the artistic landscape (which is a risk when tools completely automate and replace processes). MAGIC is rather a subtle evolution towards finer and faster intelligent control of specific steps involved in the artistic process for a specific artform – collages and mosaics. Such a tool empowers artists to explore and create more interesting artworks.


Appendix I: Architecture and loss details

Figure 3: The generator has as input a random set consisting of images sampled independently – template memory. The architecture of has three components. (i) a U-Net with ST blocks (see Figure 4) can reason about interactions between set elements and output the blending weights . (ii) The permutation-invariant pooling operation creates the collage image as a convex combination over the set elements using softmax on . (iii) a purely convolutional U-Net refines . The discriminator distinguishes true patches from the generated patches .
Figure 4: The structure of the blocks of the U-Nets used inside . For illustration, we show how the operation works for

. The input tensor is size

with channel feature maps. The set-transformer layer operates on number of sets, each of elements, taken from the same spatial position inside the memory tensor . Each slice across the dimension is a set.

The structure of the generator is shown in Figure 3. The following tensors are used for the generator , (for simplicity we skip minibatch index and write the tensor sizes when having a single instance minibatch)

  • all spatial patch dimensions are for training.

  • w.l.o.g. we can apply the generator on spatially larger input tensors since all convolutional and ST layers can handle such size adjustment.

  • let be the number of memory template set elements.

  • input is the sampled memory set

  • output of the first ST-U-Net is , a set of convex combination weights. By applying softmax on it we ensure that it holds that .

  • the collage is the convex combination of with as weights – this is a form of permutation-invariant set-pooling, since the output does not change if we permute in the first dimension, a slice along the set elements of set and set .

  • the second U-Net takes as input and output , using the parametric architecture of stacked convolutional blocks to correct the collage results and make them more perceptually plausible

takes as input randomly sampled memory sets with elements (optionally combined with content guidance patches ). This is fed to a U-Net (with skip connections as in [4]). However, in addition to standard convolutional blocks, we also add blocks that can process the set structure of the data using Set Transformer (ST) layers – we call this network architecture ST-U-Net. It outputs mixing coefficients and processes the set elements permutation-equivariantly (the order does not matter) while also taking care of the interactions. See Figure 4 for illustration how the set operation is used exactly. Note that the usual convolutional block is permutation-equivariant by definition, since it works on each element independently. Afterwards, by applying softmax on , we can calculate a convex combination of the memories from and produce the collage image with permutation-invariant set pooling.

In addition, we can also optionally do spatial warping on each image the memory set , using a parametrization like the Spatial Transformer [5] or directly a full optical flow. For this purpose, we just predict for each set element its deformation parameters . We calculate these parameters as output of the ST-U-Net together with , and apply the warping deformation before the set-pooling.

The discriminator uses a classical patchGAN [12] approach: it should discriminate the sampled training patches from the generated image patches. The overall loss for training MAGIC (and finding a good generator ) combines adversarial and content guidance terms (for the content-guidance use case, see [4, 6]) is the following:


Appendix II: Finer collage control by imposing additional GAN generator constraints

Figure 5: The effect of entropy regularization of the blending coefficients . Large regularization term weight leads to solutions with coefficient values close to 0 or 1 – low entropy, "hard attention" sparse solutions. E.g. the top row shows how we copy the hat, the left and right face parts from 3 different memory templates. Conversely a weaker does not constrain the generator to predict low entropy coefficients and allows it to relax and soften the blending weights and leads to less sparse collages, blending softly the whole memory images. Example using the CelebA dataset and 256 pixels resolution, with for inference.

The outputs of the set-transformer U-Net are the convex combination weights . These determine how the memory templates are blended (a form of set-pooling). By constraining the generator to output weights with different statistics, we gain a way to enforce different artistic choices for the collage generation. We experimented with three additional constraints that influence the generator :

  • the entropy determines how sparse the convex combination weights are. Low entropy implies the property of having one memory template be fully copied in a spatial region with weight , and others left out with weight . Conversely, high entropy will be more soft and blend more gently different templates with weight . Please see Figure 5 for an example how this changes the face blending for unsupervised MAGIC.

  • total variation determines whether we have bigger segments with small borders, or many small segments that vary spatially. On Figure 6 we show the effects of such regularization, using as example a large guided mosaic collage.

  • It is desirable to have a collage using a varied selection of the memory template elements . To achieve this, we propose to penalize the spatial size of the largest memory template for the whole spatial region. Formally, we define this term as . This term is required especially for the unguided case, where a trivial solution would be to set everywhere spatially and copy completely a single memory element. This would fool the discriminator ideally, but is a failure mode for the collage purpose. Our design of prevents that case easily.

By tuning the scalar weights for each regularization loss term we can tune the contribution of each regularization term to the total loss, in addition to the adversarial and content terms we defined already in the previous section:

Figure 6: Example of a guided collage of a human portrait content (size 1200x1600 pixels) and Berlin city fragments used to sample as set of 50 memory templates. We illustrate the effect of TV regularization of the blending coefficients on the look of the generated collages and . A small value does not constrain the generator to output smooth and leads to smaller segment cuts copying smaller details spatially and painting the content more accurately (top row). Conversely, large regularization term weight leads to solutions with small total variation, implying bigger segments with smoother borders (bottom row).

Appendix III: Detailed comparison with Fully Adversarial Mosaics (FAMOS)

Figure 7: Illustration of three adversarial guided mosaic stylization approaches, trained on Milan city images as style templates and Archimboldo portrait of size 1200x1800 pixels as content. (i) A fully convolutional U-Net generator. (ii) A hybrid method that blends from fixed templates and refines them with another U-Net. (iii) MAGIC collage using the same data. Arguably, MAGIC has the finest control over source material recombination, see text for details.

We can compare MAGIC, the method we presented in the current paper, with FAMOS[8], another approach proposing a hybrid combination of non-parametric patch copying and fully adversarial end-to-end learning. The novel ST architecture of MAGIC improves image generation quality and convergence, allowing more flexible cutting of regions of interest from memory templates. Using the set-structure allows to flexibly generalize to randomly sampled sets – unlike FAMOS where a predefined ordered tensor was required, and patch coordinates were "memorized" explicitly. We can compare visually FAMOS and MAGIC on the guided mosaic task, supported by both models. We visualize the results of FAMOS, both fully parametric and hybrid memory copying mode, and MAGIC, our novel method. For training we used patches cropped from 4 Milan city satellite images for template distribution, Archimboldo portrait as content Image of size 1200x1800 pixels.

  • The fully parametric convolutional approach (top row) is smooth, but the city image lacks visual details and has some distortion – the training distribution of Milan city maps is not accurately learned.

  • The hybrid memory template copying and refining approach (middle row) shows a different mosaic result. memory templates were available, fixed for the whole training procedure. A few of the memory templates were copied as background and then the convolutional layers added some more details on top of them, mainly depicting the content image more accurately. However, this collage has quite rough structure: too big segments are copied from the memory templates, see the plot of the mixing coefficients at the bottom right. While such large memory segment cutting also has a certain charming visual look, it is actually imprecise control of exact patch cutting and placement for the collage.

  • For comparison see the respective MAGIC results (bottom row), with memory templates randomly sampled from a whole distribution. The size of cut and pasted segment is much smaller, and MAGIC can control much better what is copied and pasted where. In addition, despite training with MAGIC can work with different number of set elements, and sample them flexibly – an advantage of its set transformer generator architecture.

While all 3 tested methods can produce beautiful mosaics with good stylization and content properties, the aesthetic quality of a mosaic is a subjective estimate of the artist or audience. We think that the fine control that MAGIC offers over the placement and cutting of memory templates makes it a worthwhile addition to an AI artist toolbox.

Appendix IV: Technical details

We implemented our code using PyTorch 1.0, and ran experiments on a Tesla V100 GPU. Each convolutional block had convolution with kernel 5x5, instance-normalization and ReLU nonlinearity. Typical for U-Nets

[4], we use downsampling and upsampling to form an hourglass shape. Channels were 48 at the largest spatial resolution and doubled when the spatial resolution was halved. For the discriminator, we could use much more channels, 128 at the first layer and doubling after every layer. We also used FP16 precision in order to get lower memory costs and fit larger set sizes – note that the complexity of the ST block is square in . The U-Nets we used had skip connections.

We used for training the usual cross-entropy GAN loss [15], using minibatches of size , effectively meaning that the sampled memory templates were 5-dimensional tensors . Since we use instance normalisation (and not batch normalisation), the batch size can be chosen flexibly depending on the GPU memory constraints. We trained on 3 V100 GPUs.

For the two experiments we showed in Figures 2 and 1 we had the following settings:

  • guided generation Fig. 1: raining data distribution for memory templates: randomly cropped 256x256 pixel patches from 6 Berlin city map images (each of resolution 1800x900 pixels). K=50 memory templates, U-Nets of depth 5. Discriminator depth 6. Batch size . For inference of the large output mosaic we can unroll on any size (typically for posters many megapixels can be used), and decide how to split spatially the rendering given the system GPU memory constraints.

  • unguided generation Fig. 2: image size 256x256 pixels, same size for training and inference. U-Nets of depth 5. Discriminator depth 7. Batchsize . Element count of memory set sampled randomly at each minibatch iteration.

We used the standard ADAM[10] optimizer settings as in [15]. In general, training a MAGIC model is quite fast, orders of magnitude faster than a respective classical parametric GAN – e.g. around 15 minutes for the guided example, 1 hour for the unguided example. Such quick time to adapt MAGIC to a new dataset and allow sampling convincing generated collages is yet another advantage for artistic exploration. This is much faster than what properly training until convergence a parametric GAN model would require.

However, we note that the integration of warping and blending makes the training of MAGIC more difficult. Depending on the deformation model (optical flow, or spatial transformer with various degrees of freedom) more training iterations may be requires. However, warping is not strictly necessary when the training data patches are well aligned for the unguided case, or for the guided case.